Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-iot-and-decision-science
Packt
13 Oct 2016
10 min read
Save for later

IoT and Decision Science

Packt
13 Oct 2016
10 min read
In this article by Jojo Moolayil, author of the book Smarter Decisions - The Intersection of Internet of Things and Decision Science, you will learn that the Internet of Things (IoT) and Decision Science have been among the hottest topics in the industry for a while now. You would have heard about IoT and wanted to learn more about it, but unfortunately you would have come across multiple names and definitions over the Internet with hazy differences between them. Also, Decision Science has grown from a nascent domain to become one of the fastest and most widespread horizontal in the industry in the recent years. With the ever-increasing volume, variety, and veracity of data, decision science has become more and more valuable for the industry. Using data to uncover latent patterns and insights to solve business problems has made it easier for businesses to take actions with better impact and accuracy. (For more resources related to this topic, see here.) Data is the new oil for the industry, and with the boom of IoT, we are in a world where more and more devices are getting connected to the Internet with sensors capturing more and more vital granular dimensions details that had never been touched earlier. The IoT is a game changer with a plethora of devices connected to each other; the industry is eagerly attempting to untap the huge potential that it can deliver. The true value and impact of IoT is delivered with the help of Decision Science. IoT has inherently generated an ocean of data where you can swim to gather insights and take smarter decisions with the intersection of Decision Science and IoT. In this book, you will learn about IoT and Decision Science in detail by solving real-life IoT business problems using a structured approach. In this article, we will begin by understanding the fundamental basics of IoT and Decision Science problem solving. You will learn the following concepts: Understanding IoT and demystifying Machine to Machine (M2M), IoT, Internet of Everything (IoE), and Industrial IoT (IIoT) Digging deeper into the logical stack of IoT Studying the problem life cycle Exploring the problem landscape The art of problem solving The problem solving framework It is highly recommended that you explore this article in depth. It focuses on the basics and concepts required to build problems and use cases. Understanding the IoT To get started with the IoT, lets first try to understand it using the easiest constructs. Internet and Things; we have two simple words here that help us understand the entire concept. So what is the Internet? It is basically a network of computing devices. Similarly, what is a Thing? It could be any real-life entity featuring Internet connectivity. So now, what do we decipher from IoT? It is a network of connected Things that can transmit and receive data from other things once connected to the network. This is how we describe the Internet of Things in a nutshell. Now, let's take a glance at the definition. IoT can be defined as the ever-growing network of Things (entities) that feature Internet connectivity and the communication that occurs between them and other Internet-enabled devices and systems. The Things in IoT are enabled with sensors that capture vital information from the device during its operations, and the device features Internet connectivity that helps it transfer and communicate to other devices and the network. Today, when we discuss about IoT, there are so many other similar terms that come into the picture, such as Industrial Internet, M2M, IoE, and a few more, and we find it difficult to understand the differences between them. Before we begin delineating the differences between these hazy terms and understand how IoT evolved in the industry, lets first take a simple real-life scenario to understand how exactly IoT looks like. IoT in a real-life scenario Let's take a simple example to understand how IoT works. Consider a scenario where you are a father in a family with a working mother and 10-year old son studying in school. You and your wife work in different offices. Your house is equipped with quite a few smart devices, say, a smart microwave, smart refrigerator, and smart TV. You are currently in office and you get notified on your smartphone that your son, Josh, has reached home from school. (He used his personal smart key to open the door.) You then use your smartphone to turn on the microwave at home to heat the sandwiches kept in it. Your son gets notified on the smart home controller that you have hot sandwiches ready for him. He quickly finishes them and starts preparing for a math test at school and you resume your work. After a while, you get notified again that your wife has also reached home (She also uses a similar smart key.) and you suddenly realize that you need to reach home to help your son with his math test. You again use your smartphone and change the air conditioner settings for three people and set the refrigerator to defrost using the app. In another 15 minutes, you are home and the air conditioning temperature is well set for three people. You then grab a can of juice from the refrigerator and discuss some math problems with your son on the couch. Intuitive, isnt it? How did it his happen and how did you access and control everything right from your phone? Well, this is how IoT works! Devices can talk to each other and also take actions based on the signals received: The IoT scenario Lets take a closer look at the same scenario. You are sitting in office and you could access the air conditioner, microwave, refrigerator, and home controller through your smartphone. Yes, the devices feature Internet connectivity and once connected to the network, they can send and receive data from other devices and take actions based on signals. A simple protocol helps these devices understand and send data and signals to a plethora of heterogeneous devices connected to the network. We will get into the details of the protocol and how these devices talk to each other soon. However, before that, we will get into some details of how this technology started and why we have so many different names today for IoT. Demystifying M2M, IoT, IIoT, and IoE So now that we have a general understanding about what is IoT, lets try to understand how it all started. A few questions that we will try to understand are: Is IoT very new in the market?, When did this start?, How did this start?, Whats the difference between M2M, IoT, IoE, and all those different names?, and so on. If we try to understand the fundamentals of IoT, that is, machines or devices connected to each other in a network, which isn't something really new and radically challenging, then what is this buzz all about? The buzz about machines talking to each other started long before most of us thought of it, and back then it was called Machine to Machine Data. In early 1950, a lot of machinery deployed for aerospace and military operations required automated communication and remote access for service and maintenance. Telemetry was where it all started. It is a process in which a highly automated communication was established from which data is collected by making measurements at remote or inaccessible geographical areas and then sent to a receiver through a cellular or wired network where it was monitored for further actions. To understand this better, lets take an example of a manned space shuttle sent for space exploration. A huge number of sensors are installed in such a space shuttle to monitor the physical condition of astronauts, the environment, and also the condition of the space shuttle. The data collected through these sensors is then sent back to the substation located on Earth, where a team would use this data to analyze and take further actions. During the same time, industrial revolution peaked and a huge number of machines were deployed in various industries. Some of these industries where failures could be catastrophic also saw the rise in machine-to-machine communication and remote monitoring: Telemetry Thus, machine-to-machine data a.k.a. M2M was born and mainly through telemetry. Unfortunately, it didnt scale to the extent that it was supposed to and this was largely because of the time it was developed in. Back then, cellular connectivity was not widespread and affordable, and installing sensors and developing the infrastructure to gather data from them was a very expensive deal. Therefore, only a small chunk of business and military use cases leveraged this. As time passed, a lot of changes happened. The Internet was born and flourished exponentially. The number of devices that got connected to the Internet was colossal. Computing power, storage capacities, and communication and technology infrastructure scaled massively. Additionally, the need to connect devices to other devices evolved, and the cost of setting up infrastructure for this became very affordable and agile. Thus came the IoT. The major difference between M2M and IoT initially was that the latter used the Internet (IPV4/6) as the medium whereas the former used cellular or wired connection for communication. However, this was mainly because of the time they evolved in. Today, heavy engineering industries have machinery deployed that communicate over the IPV4/6 network and is called Industrial IoT or sometimes M2M. The difference between the two is bare minimum and there are enough cases where both are used interchangeably. Therefore, even though M2M was actually the ancestor of IoT, today both are pretty much the same. M2M or IIoT are nowadays aggressively used to market IoT disruptions in the industrial sector. IoE or Internet of Everything was a term that surfaced on the media and Internet very recently. The term was coined by Cisco with a very intuitive definition. It emphasizes Humans as one dimension in the ecosystem. It is a more organized way of defining IoT. The IoE has logically broken down the IoT ecosystem into smaller components and simplified the ecosystem in an innovative way that was very much essential. IoE divides its ecosystem into four logical units as follows: People Processes Data Devices Built on the foundation of IoT, IoE is defined as The networked connection of People, Data, Processes, and Things. Overall, all these different terms in the IoT fraternity have more similarities than differences and, at the core, they are the same, that is, devices connecting to each other over a network. The names are then stylized to give a more intrinsic connotation of the business they refer to, such as Industrial IoT and Machine to Machine for (B2B) heavy engineering, manufacturing and energy verticals, Consumer IoT for the B2C industries, and so on. Summary In this article we learnt how to start with the IoT. It is basically a network of computing devices. Similarly, what is a Thing? It could be any real-life entity featuring Internet connectivity. So now, what do we decipher from IoT? It is a network of connected Things that can transmit and receive data from other things once connected to the network. This is how we describe the Internet of Things in a nutshell. Resources for Article: Further resources on this subject: Machine Learning Tasks [article] Welcome to Machine Learning Using the .NET Framework [article] Why Big Data in the Financial Sector? [article]
Read more
  • 0
  • 0
  • 957

article-image-reconstructing-3d-scenes
Packt
13 Oct 2016
25 min read
Save for later

Reconstructing 3D Scenes

Packt
13 Oct 2016
25 min read
In this article by Robert Laganiere, the author of the book OpenCV 3 Computer Vision Application Programming Cookbook Third Edition, has covered the following recipes: Calibrating a camera Recovering camera pose (For more resources related to this topic, see here.) Digital image formation Let us now redraw a new version of the figure describing the pin-hole camera model. More specifically, we want to demonstrate the relation between a point in 3D at position (X,Y,Z) and its image (x,y), on a camera specified in pixel coordinates. Note the changes that have been made to the original figure. First, a reference frame was positioned at the center of the projection; then, the Y-axis was alligned to point downwards to get a coordinate system that is compatible with the usual convention, which places the image origin at the upper-left corner of the image. Finally, we have identified a special point on the image plane, by considering the line coming from the focal point that is orthogonal to the image plane. The point (u0,v0) is the pixel position at which this line pierces the image plane and is called the principal point. It would be logical to assume that this principal point is at the center of the image plane, but in practice, this one might be off by few pixels depending on the precision of the camera. Since we are dealing with digital images, the number of pixels on the image plane (its resolution) is another important characteristic of a camera. We learned previously that a 3D point (X,Y,Z) will be projected onto the image plane at (fX/Z,fY/Z). Now, if we want to translate this coordinate into pixels, we need to divide the 2D image position by the pixel width (px) and then the height (py). Note that by dividing the focal length given in world units (generally given in millimeters) by px, we obtain the focal length expressed in (horizontal) pixels. We will then define this term as fx. Similarly, fy =f/py is defined as the focal length expressed in vertical pixel unit. The complete projective equation is therefore as shown: We know that (u0,v0) is the principal point that is added to the result in order to move the origin to the upper-left corner of the image. Also, the physical size of a pixel can be obtained by dividing the size of the image sensor (generally in millimeters) by the number of pixels (horizontally or vertically). In modern sensor, pixels are generally of square shape, that is, they have the same horizontal and vertical size. The preceding equations can be rewritten in matrix form. Here is the complete projective equation in its most general form: Calibrating a camera Camera calibration is the process by which the different camera parameters (that is, the ones appearing in the projective equation) are obtained. One can obviously use the specifications provided by the camera manufacturer, but for some tasks, such as 3D reconstruction, these specifications are not accurate enough. However, accurate calibration information can be obtained by undertaking an appropriate camera calibration step. An active camera calibration procedure will proceed by showing known patterns to the camera and analyzing the obtained images. An optimization process will then determine the optimal parameter values that explain the observations. This is a complex process that has been made easy by the availability of OpenCV calibration functions. How to do it... To calibrate a camera, the idea is to show it a set of scene points for which the 3D positions are known. Then, you need to observe where these points project on the image. With the knowledge of a sufficient number of 3D points and associated 2D image points, the exact camera parameters can be inferred from the projective equation. Obviously, for accurate results, we need to observe as many points as possible. One way to achieve this would be to take a picture of a scene with known 3D points, but in practice, this is rarely feasible. A more convenient way is to take several images of a set of 3D points from different viewpoints. This approach is simpler, but it requires you to compute the position of each camera view in addition to the computation of the internal camera parameters, which is fortunately feasible. OpenCV proposes that you use a chessboard pattern to generate the set of 3D scene points required for calibration. This pattern creates points at the corners of each square, and since this pattern is flat, we can freely assume that the board is located at Z=0, with the X and Y axes well-aligned with the grid. In this case, the calibration process simply consists of showing the chessboard pattern to the camera from different viewpoints. The following is an example of a calibration pattern image made of 7x5 inner corners as captured during the calibration step: The good thing is that OpenCV has a function that automatically detects the corners of this chessboard pattern. You simply provide an image and the size of the chessboard used (the number of horizontal and vertical inner corner points). The function will return the position of these chessboard corners on the image. If the function fails to find the pattern, then it simply returns false, as shown: //output vectors of image points std::vector<cv::Point2f> imageCorners; //number of inner corners on the chessboard cv::Size boardSize(7,5); //Get the chessboard corners bool found = cv::findChessboardCorners(image, // image of chessboard pattern boardSize, // size of pattern imageCorners); // list of detected corners The output parameter, imageCorners, will simply contain the pixel coordinates of the detected inner corners of the shown pattern. Note that this function accepts additional parameters if you need to tune the algorithm, which are not discussed here. There is also a special function that draws the detected corners on the chessboard image, with lines connecting them in a sequence: //Draw the corners cv::drawChessboardCorners(image, boardSize, imageCorners, found); // corners have been found The following image is obtained: The lines that connect the points show the order in which the points are listed in the vector of detected image points. To perform a calibration, we now need to specify the corresponding 3D points. You can specify these points in the units of your choice (for example, in centimeters or in inches); however, the simplest is to assume that each square represents one unit. In that case, the coordinates of the first point would be (0,0,0) (assuming that the board is located at a depth of Z=0), the coordinates of the second point would be (1,0,0), and so on, the last point being located at (6,4,0). There are a total of 35 points in this pattern, which is too less to obtain an accurate calibration. To get more points, you need to show more images of the same calibration pattern from various points of view. To do so, you can either move the pattern in front of the camera or move the camera around the board; from a mathematical point of view, this is completely equivalent. The OpenCV calibration function assumes that the reference frame is fixed on the calibration pattern and will calculate the rotation and translation of the camera with respect to the reference frame. Let‘s now encapsulate the calibration process in a CameraCalibrator class. The attributes of this class are as follows: // input points: // the points in world coordinates // (each square is one unit) std::vector<std::vector<cv::Point3f>> objectPoints; // the image point positions in pixels std::vector<std::vector<cv::Point2f>> imagePoints; // output Matrices cv::Mat cameraMatrix; cv::Mat distCoeffs; // flag to specify how calibration is done int flag; Note that the input vectors of the scene and image points are in fact made of std::vector of point instances; each vector element is a vector of the points from one view. Here, we decided to add the calibration points by specifying a vector of the chessboard image filename as input, the method will take care of extracting the point coordinates from the images: // Open chessboard images and extract corner points int CameraCalibrator::addChessboardPoints(const std::vector<std::string>& filelist, // list of filenames cv::Size & boardSize) { // calibration noard size // the points on the chessboard std::vector<cv::Point2f> imageCorners; std::vector<cv::Point3f> objectCorners; // 3D Scene Points: // Initialize the chessboard corners // in the chessboard reference frame // The corners are at 3D location (X,Y,Z)= (i,j,0) for (int i=0; i<boardSize.height; i++) { for (int j=0; j<boardSize.width; j++) { objectCorners.push_back(cv::Point3f(i, j, 0.0f)); } } // 2D Image points: cv::Mat image; // to contain chessboard image int successes = 0; // for all viewpoints for (int i=0; i<filelist.size(); i++) { // Open the image image = cv::imread(filelist[i],0); // Get the chessboard corners bool found = cv::findChessboardCorners(image, //image of chessboard pattern boardSize, // size of pattern imageCorners); // list of detected corners // Get subpixel accuracy on the corners if (found) { cv::cornerSubPix(image, imageCorners, cv::Size(5, 5), // half size of serach window cv::Size(-1, -1), cv::TermCriteria(cv::TermCriteria::MAX_ITER + cv::TermCriteria::EPS,30,// max number of iterations 0.1)); // min accuracy // If we have a good board, add it to our data if (imageCorners.size() == boardSize.area()) { // Add image and scene points from one view addPoints(imageCorners, objectCorners); successes++; } } //If we have a good board, add it to our data if (imageCorners.size() == boardSize.area()) { // Add image and scene points from one view addPoints(imageCorners, objectCorners); successes++; } } return successes; } The first loop inputs the 3D coordinates of the chessboard and the corresponding image points are the ones provided by the cv::findChessboardCorners function; this is done for all the available viewpoints. Moreover, in order to obtain a more accurate image point location, the cv::cornerSubPix function can be used, and as the name suggests, the image points will then be localized at subpixel accuracy. The termination criterion that is specified by the cv::TermCriteria object defines the maximum number of iterations and the minimum accuracy in subpixel coordinates. The first of these two conditions that is reached will stop the corner refinement process. When a set of chessboard corners have been successfully detected, these points are added to the vectors of the image and scene points using our addPoints method. Once a sufficient number of chessboard images have been processed (and consequently, a large number of 3D scene point / 2D image point correspondences are available), we can initiate the computation of the calibration parameters as shown: // Calibrate the camera // returns the re-projection error double CameraCalibrator::calibrate(cv::Size &imageSize){ //Output rotations and translations std::vector<cv::Mat> rvecs, tvecs; // start calibration return calibrateCamera(objectPoints, // the 3D points imagePoints, // the image points imageSize, // image size cameraMatrix, // output camera matrix distCoeffs, // output distortion matrix rvecs, tvecs, // Rs, Ts flag); // set options } In practice, 10 to 20 chessboard images are sufficient, but these must be taken from different viewpoints at different depths. The two important outputs of this function are the camera matrix and the distortion parameters. These will be described in the next section. How it works... In order to explain the result of the calibration, we need to go back to the projective equation presented in the introduction of this article. This equation describes the transformation of a 3D point into a 2D point through the successive application of two matrices. The first matrix includes all of the camera parameters, which are called the intrinsic parameters of the camera. This 3x3 matrix is one of the output matrices returned by the cv::calibrateCamera function. There is also a function called cv::calibrationMatrixValues that explicitly returns the value of the intrinsic parameters given by a calibration matrix. The second matrix is there to have the input points expressed into camera-centric coordinates. It is composed of a rotation vector (a 3x3 matrix) and a translation vector (a 3x1 matrix). Remember that in our calibration example, the reference frame was placed on the chessboard. Therefore, there is a rigid transformation (made of a rotation component represented by the matrix entries r1 to r9 and a translation represented by t1, t2, and t3) that must be computed for each view. These are in the output parameter list of the cv::calibrateCamera function. The rotation and translation components are often called the extrinsic parameters of the calibration and they are different for each view. The intrinsic parameters remain constant for a given camera/lens system. The calibration results provided by the cv::calibrateCamera are obtained through an optimization process. This process aims to find the intrinsic and extrinsic parameters that minimizes the difference between the predicted image point position, as computed from the projection of the 3D scene points, and the actual image point position, as observed on the image. The sum of this difference for all the points specified during the calibration is called the re-projection error. The intrinsic parameters of our test camera obtained from a calibration based on the 27 chessboard images are fx=409 pixels; fy=408 pixels; u0=237; and v0=171. Our calibration images have a size of 536x356 pixels. From the calibration results, you can see that, as expected, the principal point is close to the center of the image, but yet off by few pixels. The calibration images were taken using a Nikon D500 camera with an 18mm lens. Looking at the manufacturer specifitions, we find that the sensor size of this camera is 23.5mm x 15.7mm which gives us a pixel size of 0.0438mm. The estimated focal length is expressed in pixels, so multiplying the result by the pixel size gives us an estimated focal length of 17.8mm, which is consistent with the actual lens we used. Let us now turn our attention to the distortion parameters. So far, we have mentioned that under the pin-hole camera model, we can neglect the effect of the lens. However, this is only possible if the lens that is used to capture an image does not introduce important optical distortions. Unfortunately, this is not the case with lower quality lenses or with lenses that have a very short focal length. Even the lens we used in this experiment introduced some distortion, that is, the edges of the rectangular board are curved in the image. Note that this distortion becomes more important as we move away from the center of the image. This is a typical distortion observed with a fish-eye lens and is called radial distortion. It is possible to compensate for these deformations by introducing an appropriate distortion model. The idea is to represent the distortions induced by a lens by a set of mathematical equations. Once established, these equations can then be reverted in order to undo the distortions visible on the image. Fortunately, the exact parameters of the transformation, which will correct the distortions, can be obtained together with the other camera parameters during the calibration phase. Once this is done, any image from the newly calibrated camera will be undistorted. Therefore, we have added an additional method to our calibration class. //remove distortion in an image (after calibration) cv::Mat CameraCalibrator::remap(const cv::Mat &image) { cv::Mat undistorted; if (mustInitUndistort) { //called once per calibration cv::initUndistortRectifyMap(cameraMatrix, // computed camera matrix distCoeffs, // computed distortion matrix cv::Mat(), // optional rectification (none) cv::Mat(), // camera matrix to generate undistorted image.size(), // size of undistorted CV_32FC1, // type of output map map1, map2); // the x and y mapping functions mustInitUndistort= false; } // Apply mapping functions cv::remap(image, undistorted, map1, map2, cv::INTER_LINEAR); // interpolation type return undistorted; } Running this code on one of our calibration image results in the following undistorted image: To correct the distortion, OpenCV uses a polynomial function that is applied to the image points in order to move them at their undistorted position. By default, five coefficients are used; a model made of eight coefficients is also available. Once these coefficients are obtained, it is possible to compute two cv::Mat mapping functions (one for the x coordinate and one for the y coordinate) that will give the new undistorted position of an image point on a distorted image. This is computed by the cv::initUndistortRectifyMap function, and the cv::remap function remaps all the points of an input image to a new image. Note that because of the nonlinear transformation, some pixels of the input image now fall outside the boundary of the output image. You can expand the size of the output image to compensate for this loss of pixels, but you now obtain output pixels that have no values in the input image (they will then be displayed as black pixels). There‘s more... More options are available when it comes to camera calibration. Calibration with known intrinsic parameters When a good estimate of the camera’s intrinsic parameters is known, it could be advantageous to input them in the cv::calibrateCamera function. They will then be used as initial values in the optimization process. To do so, you just need to add the cv::CALIB_USE_INTRINSIC_GUESS flag and input these values in the calibration matrix parameter. It is also possible to impose a fixed value for the principal point (cv::CALIB_FIX_PRINCIPAL_POINT), which can often be assumed to be the central pixel. You can also impose a fixed ratio for the focal lengths fx and fy (cv::CALIB_FIX_RATIO); in which case, you assume that the pixels have a square shape. Using a grid of circles for calibration Instead of the usual chessboard pattern, OpenCV also offers the possibility to calibrate a camera by using a grid of circles. In this case, the centers of the circles are used as calibration points. The corresponding function is very similar to the function we used to locate the chessboard corners, for example: cv::Size boardSize(7,7); std::vector<cv::Point2f> centers; bool found = cv:: findCirclesGrid(image, boardSize, centers); See also The A flexible new technique for camera calibration article by Z. Zhang  in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no 11, 2000, is a classic paper on the problem of camera calibration Recovering camera pose When a camera is calibrated, it becomes possible to relate the captured with the outside world. If the 3D structure of an object is known, then one can predict how the object will be imaged on the sensor of the camera. The process of image formation is indeed completely described by the projective equation that was presented at the beginning of this article. When most of the terms of this equation are known, it becomes possible to infer the value of the other elements (2D or 3D) through the observation of some images. In this recipe, we will look at the camera pose recovery problem when a known 3D structure is observed. How to do it... Lets consider a simple object here, a bench in a park. We took an image of it using the camera/lens system calibrated in the previous recipe. We have manually identified 8 distinct image points on the bench that we will use for our camera pose estimation. Having access to this object makes it possible to make some physical measurements. The bench is composed of a seat of size 242.5cmx53.5cmx9cm and a back of size 242.5cmx24cmx9cm that is fixed 12cm over the seat. Using this information, we can then easily derive the 3D coordinates of the eight identified points in an object-centric reference frame (here we fixed the origin at the left extremity of the intersection between the two planes). We can then create a vector of cv::Point3f containing these coordinates. //Input object points std::vector<cv::Point3f> objectPoints; objectPoints.push_back(cv::Point3f(0, 45, 0)); objectPoints.push_back(cv::Point3f(242.5, 45, 0)); objectPoints.push_back(cv::Point3f(242.5, 21, 0)); objectPoints.push_back(cv::Point3f(0, 21, 0)); objectPoints.push_back(cv::Point3f(0, 9, -9)); objectPoints.push_back(cv::Point3f(242.5, 9, -9)); objectPoints.push_back(cv::Point3f(242.5, 9, 44.5)); objectPoints.push_back(cv::Point3f(0, 9, 44.5)); The question now is where the camera was with respect to these points when the shown picture was taken. Since the coordinates of the image of these known points on the 2D image plane are also known, it becomes easy to answer this question using the cv::solvePnP function. Here, the correspondence between the 3D and the 2D points has been established manually, but as a reader of this book, you should be able to come up with methods that would allow you to obtain this information automatically. //Input image points std::vector<cv::Point2f> imagePoints; imagePoints.push_back(cv::Point2f(136, 113)); imagePoints.push_back(cv::Point2f(379, 114)); imagePoints.push_back(cv::Point2f(379, 150)); imagePoints.push_back(cv::Point2f(138, 135)); imagePoints.push_back(cv::Point2f(143, 146)); imagePoints.push_back(cv::Point2f(381, 166)); imagePoints.push_back(cv::Point2f(345, 194)); imagePoints.push_back(cv::Point2f(103, 161)); // Get the camera pose from 3D/2D points cv::Mat rvec, tvec; cv::solvePnP(objectPoints, imagePoints, // corresponding 3D/2D pts cameraMatrix, cameraDistCoeffs, // calibration rvec, tvec); // output pose // Convert to 3D rotation matrix cv::Mat rotation; cv::Rodrigues(rvec, rotation); This function computes the rigid transformation (rotation and translation) that brings the object coordinates in the camera-centric reference frame (that is, the ones that has its origin at the focal point). It is also important to note that the rotation computed by this function is given in the form of a 3D vector. This is a compact representation in which the rotation to apply is described by a unit vector (an axis of rotation) around which the object is rotated by a certain angle. This axis-angle representation is also called the Rodrigues’ rotation formula. In OpenCV, the angle of rotation corresponds to the norm of the output rotation vector, which is later aligned with the axis of rotation. This is why the cv::Rodrigues function is used to obtain the 3D matrix of rotation that appears in our projective equation. The pose recovery procedure described here is simple, but how do we know we obtained the right camera/object pose information. We can visually assess the quality of the results using the cv::viz module that gives us the ability to visualize 3D information. The use of this module is explained in the last section of this recipe. For now, lets display a simple 3D representation of our object and the camera that captured it: It might be difficult to judge of the quality of the pose recovery just by looking at this image but if you test the example of this recipe on your computer, you will have the possibility to move this representation in 3D using your mouse which should give you a better sense of the solution obtained. How it works... In this recipe, we assumed that the 3D structure of the object was known as well as the correspondence between sets of object points and image points. The camera’s intrinsic parameters were also known through calibration. If you look at our projective equation presented at the end of the Digital image formation section of the introduction of this article, this means that we have points for which coordinates (X,Y,Z) and (x,y) are known. We also have the elements of first matrix known (the intrinsic parameters). Only the second matrix is unknown; this is the one that contains the extrinsic parameters of the camera that is the camera/object pose information. Our objective is to recover these unknown parameters from the observation of 3D scene points. This problem is known as the Perspective-n-Point problem or PnP problem. Rotation has three degrees of freedom (for example, angle of rotation around the three axes) and translation also has three degrees of freedom. We therefore have a total of 6 unknowns. For each object point/image point correspondence, the projective equation gives us three algebraic equations but since the projective equation is up to a scale factor, we only have 2 independent equations. A minimum of three points is therefore required to solve this system of equations. Obviously, more points provide a more reliable estimate. In practice, many different algorithms have been proposed to solve this problem and OpenCV proposes a number of different implementation in its cv::solvePnP function. The default method consists in optimizing what is called the reprojection error. Minimizing this type of error is considered to be the best strategy to get accurate 3D information from camera images. In our problem, it corresponds to finding the optimal camera position that minimizes the 2D distance between the projected 3D points (as obtained by applying the projective equation) and the observed image points given as input. Note that OpenCV also has a cv::solvePnPRansac function. As the name suggest, this function uses the RANSAC algorithm in order to solve the PnP problem. This means that some of the object points/image points correspondences may be wrong and the function will returns which ones have been identified as outliers. This is very useful when these correspondences have been obtained through an automatic process that can fail for some points. There‘s more... When working with 3D information, it is often difficult to validate the solutions obtained. To this end, OpenCV offers a simple yet powerful visualization module that facilitates the development and debugging of 3D vision algorithms. It allows inserting points, lines, cameras, and other objects in a virtual 3D environment that you can interactively visualize from various points of views. cv::Viz, a 3D Visualizer module cv::Viz is an extra module of the OpenCV library that is built on top of the VTK open source library. This Visualization Tooolkit (VTK) is a powerful framework used for 3D computer graphics. With cv::viz, you create a 3D virtual environment to which you can add a variety of objects. A visualization window is created that displays the environment from a given point of view. You saw in this recipe an example of what can be displayed in a cv::viz window. This window responds to mouse events that are used to navigate inside the environment (through rotations and translations). This section describes the basic use of the cv::viz module. The first thing to do is to create the visualization window. Here we use a white background: // Create a viz window cv::viz::Viz3d visualizer(“Viz window“); visualizer.setBackgroundColor(cv::viz::Color::white()); Next, you create your virtual objects and insert them into the scene. There is a variety of predefined objects. One of them is particularly useful for us; it is the one that creates a virtual pin-hole camera: // Create a virtual camera cv::viz::WCameraPosition cam(cMatrix, // matrix of intrinsics image, // image displayed on the plane 30.0, // scale factor cv::viz::Color::black()); // Add the virtual camera to the environment visualizer.showWidget(“Camera“, cam); The cMatrix variable is a cv::Matx33d (that is,a cv::Matx<double,3,3>) instance containing the intrinsic camera parameters as obtained from calibration. By default this camera is inserted at the origin of the coordinate system. To represent the bench, we used two rectangular cuboid objects. // Create a virtual bench from cuboids cv::viz::WCube plane1(cv::Point3f(0.0, 45.0, 0.0), cv::Point3f(242.5, 21.0, -9.0), true, // show wire frame cv::viz::Color::blue()); plane1.setRenderingProperty(cv::viz::LINE_WIDTH, 4.0); cv::viz::WCube plane2(cv::Point3f(0.0, 9.0, -9.0), cv::Point3f(242.5, 0.0, 44.5), true, // show wire frame cv::viz::Color::blue()); plane2.setRenderingProperty(cv::viz::LINE_WIDTH, 4.0); // Add the virtual objects to the environment visualizer.showWidget(“top“, plane1); visualizer.showWidget(“bottom“, plane2); This virtual bench is also added at the origin; it then needs to be moved at its camera-centric position as found from our cv::solvePnP function. It is the responsibility of the setWidgetPose method to perform this operation. This one simply applies the rotation and translation components of the estimated motion. cv::Mat rotation; // convert vector-3 rotation // to a 3x3 rotation matrix cv::Rodrigues(rvec, rotation); // Move the bench cv::Affine3d pose(rotation, tvec); visualizer.setWidgetPose(“top“, pose); visualizer.setWidgetPose(“bottom“, pose); The final step is to create a loop that keeps displaying the visualization window. The 1ms pause is there to listen to mouse events. // visualization loop while(cv::waitKey(100)==-1 && !visualizer.wasStopped()) { visualizer.spinOnce(1, // pause 1ms true); // redraw } This loop will stop when the visualization window is closed or when a key is pressed over an OpenCV image window. Try to apply inside this loop some motion on an object (using setWidgetPose); this is how animation can be created. See also Model-based object pose in 25 lines of code by D. DeMenthon and L. S. Davis, in European Conference on Computer Vision, 1992, pp.335–343 is a famous method for recovering camera pose from scene points. Summary This article teaches us how, under specific conditions, the 3D structure of the scene and the 3D pose of the cameras that captured it can be recovered. We have seen how a good understanding of projective geometry concepts allows to devise methods enabling 3D reconstruction. Resources for Article: Further resources on this subject: OpenCV: Image Processing using Morphological Filters [article] Learn computer vision applications in Open CV [article] Cardboard is Virtual Reality for Everyone [article]
Read more
  • 0
  • 0
  • 6864

article-image-solving-nlp-problem-keras-part-2
Sasank Chilamkurthy
13 Oct 2016
6 min read
Save for later

Solving an NLP Problem with Keras, Part 2

Sasank Chilamkurthy
13 Oct 2016
6 min read
In this two-part post series, we are solving a Natural Language Processing (NLP) problem with Keras. In Part 1, we covered the problem and the ATIS dataset we are using. We also went over the word embeddings (mapping words to a vector) along with Recurrent Neural Networks that solve complicated word tagging problems. We passed the word embedding sequence as input into the RNN and we then started coding that up. Now, it is time in this post to start loading the data. Loading Data Let's load the data using data.load.atisfull(). It will download the data the first time it is run. Words and labels are encoded as indexes to a vocabulary. This vocabulary is stored in w2idx and labels2idx. import numpy as np import data.load train_set, valid_set, dicts = data.load.atisfull() w2idx, labels2idx = dicts['words2idx'], dicts['labels2idx'] train_x, _, train_label = train_set val_x, _, val_label = valid_set # Create index to word/label dicts idx2w = {w2idx[k]:k for k in w2idx} idx2la = {labels2idx[k]:k for k in labels2idx} # For conlleval script words_train = [ list(map(lambda x: idx2w[x], w)) for w in train_x] labels_train = [ list(map(lambda x: idx2la[x], y)) for y in train_label] words_val = [ list(map(lambda x: idx2w[x], w)) for w in val_x] labels_val = [ list(map(lambda x: idx2la[x], y)) for y in val_label] n_classes = len(idx2la) n_vocab = len(idx2w) Let's print an example sentence and label: print("Example sentence : {}".format(words_train[0])) print("Encoded form: {}".format(train_x[0])) print() print("It's label : {}".format(labels_train[0])) print("Encoded form: {}".format(train_label[0])) Here is the output: Example sentence : ['i', 'want', 'to', 'fly', 'from', 'boston', 'at', 'DIGITDIGITDIGIT', 'am', 'and', 'arrive', 'in', 'denver', 'at', 'DIGITDIGITDIGITDIGIT', 'in', 'the', 'morning'] Encoded form: [232 542 502 196 208 77 62 10 35 40 58 234 137 62 11 234 481 321] It's label : ['O', 'O', 'O', 'O', 'O', 'B-fromloc.city_name', 'O', 'B-depart_time.time', 'I-depart_time.time', 'O', 'O', 'O', 'B-toloc.city_name', 'O', 'B-arrive_time.time', 'O', 'O', 'B-arrive_time.period_of_day'] Encoded form: [126 126 126 126 126 48 126 35 99 126 126 126 78 126 14 126 126 12] Keras model Next, we define the Keras model. Keras has an inbuilt Embedding layer for word embeddings. It expects integer indices. SimpleRNN is the recurrent neural network layer described in Part 1. We will have to use TimeDistributed to pass the output of RNN Ot At each time step: t To a fully connected layer. Otherwise, the output at the final time step will be passed on to the next layer. from keras.models import Sequential from keras.layers.embeddings import Embedding from keras.layers.recurrent import SimpleRNN from keras.layers.core import Dense, Dropout from keras.layers.wrappers import TimeDistributed from keras.layers import Convolution1D model = Sequential() model.add(Embedding(n_vocab,100)) model.add(Dropout(0.25)) model.add(SimpleRNN(100,return_sequences=True)) model.add(TimeDistributed(Dense(n_classes, activation='softmax'))) model.compile('rmsprop', 'categorical_crossentropy') Training Now, let's start training our model. We will pass each sentence as a batch to the model. We cannot use model.fit() because it expects all of the sentences to be the same size. We will therefore use model.train_on_batch(). Training is very fast, since the dataset is relatively small. Each epoch takes 20 seconds on my Macbook Air. import progressbar n_epochs = 30 for i in range(n_epochs): print("Training epoch {}".format(i)) bar = progressbar.ProgressBar(max_value=len(train_x)) for n_batch, sent in bar(enumerate(train_x)): label = train_label[n_batch] # Make labels one hot label = np.eye(n_classes)[label][np.newaxis,:] # View each sentence as a batch sent = sent[np.newaxis,:] if sent.shape[1] >1: #ignore 1 word sentences model.train_on_batch(sent, label) Evaluation To measure the accuracy of the model, we use model.predict_on_batch() and metrics.accuracy.conlleval(). from metrics.accuracy import conlleval labels_pred_val = [] bar = progressbar.ProgressBar(max_value=len(val_x)) for n_batch, sent in bar(enumerate(val_x)): label = val_label[n_batch] label = np.eye(n_classes)[label][np.newaxis,:] sent = sent[np.newaxis,:] pred = model.predict_on_batch(sent) pred = np.argmax(pred,-1)[0] labels_pred_val.append(pred) labels_pred_val = [ list(map(lambda x: idx2la[x], y)) for y in labels_pred_val] con_dict = conlleval(labels_pred_val, labels_val, words_val, 'measure.txt') print('Precision = {}, Recall = {}, F1 = {}'.format( con_dict['r'], con_dict['p'], con_dict['f1'])) With this model, I get a 92.36 F1 Score. Precision = 92.07, Recall = 92.66, F1 = 92.36 Note that for the sake of brevity, I've not shown the logging part of the code. Loggging losses and accuracies are an important part of coding up an model. An improved model (described in the next section) with logging is at main.py. You can run it as : $ python main.py Improvements One drawback with our current model is that there is no look ahead, that is, output: ot This depends only on the current and previous words, but not on the words next to it. You can imagine clues about the properties of the current word that are also held by the next word. Lookahead can easily be implemented by having a convolutional layer before RNN and word embeddings: model = Sequential() model.add(Embedding(n_vocab,100)) model.add(Convolution1D(128, 5, border_mode='same', activation='relu')) model.add(Dropout(0.25)) model.add(GRU(100,return_sequences=True)) model.add(TimeDistributed(Dense(n_classes, activation='softmax'))) model.compile('rmsprop', 'categorical_crossentropy') With this improved model, I get a 94.90F1 Score! Conclusion In this two-part post series, you learned about word embeddings and RNNs. We applied these to an NLP problem: ATIS. We also made an improvement to our model. To improve the model further, you can try using word embeddings learned on a large site like Wikipedia. Also, there are variants of RNNs such as LSTM or GRU that can be experimented with. About the author Sasank Chilamkurthy works at Fractal Analytics. His work involves deep learning  on medical images obtained from radiology and pathology. He is mainly  interested in computer vision.
Read more
  • 0
  • 0
  • 2732
Visually different images

article-image-basics-image-histograms-opencv
Packt
12 Oct 2016
11 min read
Save for later

Basics of Image Histograms in OpenCV

Packt
12 Oct 2016
11 min read
In this article by Samyak Datta, author of the book Learning OpenCV 3 Application Development we are going to focus our attention on a different style of processing pixel values. The output of the techniques, which would comprise our study in the current article, will not be images, but other forms of representation for images, namely image histograms. We have seen that a two-dimensional grid of intensity values is one of the default forms of representing images in digital systems for processing as well as storage. However, such representations are not at all easy to scale. So, for an image with a reasonably low spatial resolution, say 512 x 512 pixels, working with a two-dimensional grid might not pose any serious issues. However, as the dimensions increase, the corresponding increase in the size of the grid may start to adversely affect the performance of the algorithms that work with the images. A primary advantage that an image histogram has to offer is that the size of a histogram is a constant that is independent of the dimensions of the image. As a consequence of this, we are guaranteed that irrespective of the spatial resolution of the images that we are dealing with, the algorithms that power our solutions will have to deal with a constant amount of data if they are working with image histograms. (For more resources related to this topic, see here.) Each descriptor captures some particular aspects or features of the image to construct its own form of representation. One of the common pitfalls of using histograms as a form of image representation as compared to its native form of using the entire two-dimensional grid of values is loss of information. A full-fledged image representation using pixel intensity values for all pixel locations naturally consists of all the information that you would need to reconstruct a digital image. However, the same cannot be said about histograms. When we study about image histograms in detail, we'll get to see exactly what information do we stand to lose. And this loss in information is prevalent across all forms of image descriptors. The basics of histograms At the outset, we will briefly explain the concept of a histogram. Most of you might already know this from your lessons on basic statistics. However, we will reiterate this for the sake of completeness. Histogram is a form of data representation technique that relies on an aggregation of data points. The data is aggregated into a set of predefined bins that are represented along the x axis, and the number of data points that fall within each of the bins make up the corresponding counts on the y axis. For example, let's assume that our data looks something like the following: D={2,7,1,5,6,9,14,11,8,10,13} If we define three bins, namely Bin_1 (1 - 5), Bin_2 (6 - 10), and Bin_3 (11 - 15), then the histogram corresponding to our data would look something like this: Bins Frequency Bin_1 (1 - 5) 3 Bin_2 (6 - 10) 5 Bin_3 (11 - 15) 3 What this histogram data tells us is that we have three values between 1 and 5, five between 6 and 10, and three again between 11 and 15. Note that it doesn't tell us what the values are, just that some n values exist in a given bin. A more familiar visual representation of the histogram in discussion is shown as follows: As you can see, the bins have been plotted along the x axis and their corresponding frequencies along the y axis. Now, in the context of images, how is a histogram computed? Well, it's not that difficult to deduce. Since the data that we have comprise pixel intensity values, an image histogram is computed by plotting a histogram using the intensity values of all its constituent pixels. What this essentially means is that the sequence of pixel intensity values in our image becomes the data. Well, this is in fact the simplest kind of histogram that you can compute using the information available to you from the image. Now, coming back to image histograms, there are some basic terminologies (pertaining to histograms in general) that you need to be aware of before you can dip your hands into code. We have explained them in detail here: Histogram size: The histogram size refers to the number of bins in the histogram. Range: The range of a histogram is the range of data that we are dealing with. The range of data as well as the histogram size are both important parameters that define a histogram. Dimensions: Simply put, dimensions refer to the number of the type of items whose values we aggregate in the histogram bins. For example, consider a grayscale image. We might want to construct a histogram using the pixel intensity values for such an image. This would be an example of a single-dimensional histogram because we are just interested in aggregating the pixel intensity values and nothing else. The data, in this case, is spread over a range of 0 to 255. On account of being one-dimensional, such histograms can be represented graphically as 2D plots—one-dimensional data (pixel intensity values) being plotted on the x axis (in the form of bins) along with the corresponding frequency counts along the y axis. We have already seen an example of this before. Now, imagine a color image with three channels: red, green, and blue. Let's say that we want to plot a histogram for the intensities in the red and green channels combined. This means that our data now becomes a pair of values (r, g). A histogram that is plotted for such data will have a dimensionality of 2. The plot for such a histogram will be a 3D plot with the data bins covering the x and y axes and the frequency counts plotted along the z axis. Now that we have discussed the theoretical aspects of image histograms in detail, let's start thinking along the lines of code. We will start with the simplest (and in fact the most ubiquitous) design of image histograms. The range of our data will be from 0 to 255 (both inclusive), which means that all our data points will be integers that fall within the specified range. Also, the number of data points will equal the number of pixels that make up our input image. The simplicity in design comes from the fact that we fix the size of the histogram (the number of bins) as 256. Now, take a moment to think about what this means. There are 256 different possible values that our data points can take and we have a separate bin corresponding to each one of those values. So such an image histogram will essentially depict the 256 possible intensity values along with the counts of the number of pixels in the image that are colored with each of the different intensities. Before taking a peek at what OpenCV has to offer, let's try to implement such a histogram on our own! We define a function named computeHistogram() that takes the grayscale image as an input argument and returns the image histogram. From our earlier discussions, it is evident that the histogram must contain 256 entries (for the 256 bins): one for each integer between 0 and 255. The value stored in the histogram corresponding to each of the 256 entries will be the count of the image pixels that have a particular intensity value. So, conceptually, we can use an array for our implementation such that the value stored in the histogram [ i ] (for 0≤i≤255) will be the count of the number of pixels in the image having the intensity of i. However, instead of using a C++ array, we will comply with the rules and standards followed by OpenCV and represent the histogram as a Mat object. We have already seen that a Mat object is nothing but a multidimensional array store. The implementation is outlined in the following code snippet: Mat computeHistogram(Mat input_image) { Mat histogram = Mat::zeros(256, 1, CV_32S); for (int i = 0; i < input_image.rows; ++i) { for (int j = 0; j < input_image.cols; ++j) { int binIdx = (int) input_image.at<uchar>(i, j); histogram.at<int>(binIdx, 0) += 1; } } return histogram; } As you can see, we have chosen to represent the histogram as a 256-element-column-vector Mat object. We iterate over all the pixels in the input image and keep on incrementing the corresponding counts in the histogram (which had been initialized to 0). As per our description of the image histogram properties, it is easy to see that the intensity value of any pixel is the same as the bin index that is used to index into the appropriate histogram bin to increment the count. Having such an implementation ready, let's test it out with the help of an actual image. The following code demonstrates a main() function that reads an input image, calls the computeHistogram() function that we have defined just now, and displays the contents of the histogram that is returned as a result: int main() { Mat input_image = imread("/home/samyak/Pictures/lena.jpg", IMREAD_GRAYSCALE); Mat histogram = computeHistogram(input_image); cout << "Histogram...n"; for (int i = 0; i < histogram.rows; ++i) cout << i << " : " << histogram.at<int>(i, 0) << "n"; return 0; } We have used the fact that the histogram that is returned from the function will be a single column Mat object. This makes the code that displays the contents of the histogram much cleaner. Histograms in OpenCV We have just seen the implementation of a very basic and minimalistic histogram using the first principles in OpenCV. The image histogram was basic in the sense that all the bins were uniform in size and comprised only a single pixel intensity. This made our lives simple when we designed our code for the implementation; there wasn't any need to explicitly check the membership of a data point (the intensity value of a pixel) with all the bins of our histograms. However, we know that a histogram can have bins whose sizes span more than one. Can you think of the changes that we might need to make in the code that we had written just now to accommodate for bin sizes larger than 1? If this change seems doable to you, try to figure out how to incorporate the possibility of non-uniform bin sizes or multidimensional histograms. By now, things might have started to get a little overwhelming to you. No need to worry. As always, OpenCV has you covered! The developers at OpenCV have provided you with a calcHist() function whose sole purpose is to calculate the histograms for a given set of arrays. By arrays, we refer to the images represented as Mat objects, and we use the term set because the function has the capability to compute multidimensional histograms from the given data: Mat computeHistogram(Mat input_image) { Mat histogram; int channels[] = { 0 }; int histSize[] = { 256 }; float range[] = { 0, 256 }; const float* ranges[] = { range }; calcHist(&input_image, 1, channels, Mat(), histogram, 1, histSize, ranges, true, false); return histogram; } Before we move on to an explanation of the different parameters involved in the calcHist() function call, I want to bring your attention to the abundant use of arrays in the preceding code snippet. Even arguments as simple as histogram sizes are passed to the function in the form of arrays rather than integer values, which at first glance seem quite unnecessary and counter-intuitive. The usage of arrays is due to the fact that the implementation of calcHist() is equipped to handle multidimensional histograms as well, and when we are dealing with such multidimensional histogram data, we require multiple parameters to be passed, one for each dimension. This would become clearer once we demonstrate an example of calculating multidimensional histograms using the calcHist() function. For the time being, we just wanted to clear the immediate confusion that might have popped up in your minds upon seeing the array parameters. Here is a detailed list of the arguments in the calcHist() function call: Source images Number of source images Channel indices Mask Dimensions (dims) Histogram size Ranges Uniform flag Accumulate flag The last couple of arguments (the uniform and accumulate flags) have default values of true and false, respectively. Hence, the function call that you have seen just now can very well be written as follows: calcHist(&input_image, 1, channels, Mat(), histogram, 1, histSize, ranges); Summary Thus in this article we have successfully studied fundamentals of using histograms in OpenCV for image processing. Resources for Article: Further resources on this subject: Remote Sensing and Histogram [article] OpenCV: Image Processing using Morphological Filters [article] Learn computer vision applications in Open CV [article]
Read more
  • 0
  • 0
  • 9185

article-image-solving-nlp-problem-keras-part-1
Sasank Chilamkurthy
12 Oct 2016
5 min read
Save for later

Solving an NLP Problem with Keras, Part 1

Sasank Chilamkurthy
12 Oct 2016
5 min read
In a previous two-part post series on Keras, I introduced Convolutional Neural Networks(CNNs) and the Keras deep learning framework. We used them to solve a Computer Vision (CV) problem involving traffic sign recognition. Now, in this two-part post series, we will solve a Natural Language Processing (NLP) problem with Keras. Let’s begin. The Problem and the Dataset The problem we are going to tackle is Natural Language Understanding. The aim is to extract the meaning of speech utterances. This is still an unsolved problem. Therefore, we can break this problem into a solvable practical problem of understanding the speaker in a limited context. In particular, we want to identify the intent of a speaker asking for information about flights. The dataset we are going to use is Airline Travel Information System (ATIS). This dataset was collected by DARPA in the early 90s. ATIS consists of spoken queries on flight related information. An example utterance is I want to go from Boston to Atlanta on Monday. Understanding this is then reduced to identifying arguments like Destination and Departure Day. This task is called slot-filling. Here is an example sentence and its labels. You will observe that labels are encoded in an Inside Outside Beginning (IOB) representation. Let’s look at the dataset: |Words | Show | flights | from | Boston | to | New | York| today| |Labels| O | O | O |B-dept | O|B-arr|I-arr|B-date| The ATIS official split contains 4,978/893 sentences for a total of 56,590/9,198 words (average sentence length is 15) in the train/test set. The number of classes (different slots) is 128, including the O label (NULL). Unseen words in the test set are encoded by the <UNK> token, and each digit is replaced with string DIGIT;that is,20 is converted to DIGITDIGIT. Our approach to the problem is to use: Word embeddings Recurrent neural networks I'll talk about these briefly in the following sections. Word Embeddings Word embeddings map words to a vector in a high-dimensional space. These word embeddings can actually learn the semantic and syntactic information of words. For instance, they can understand that similar words are close to each other in this space and dissimilar words are far apart. This can be learned either using large amounts of text like Wikipedia, or specifically for a given problem. We will take the second approach for this problem. As an illustation, I have shown here the nearest neighbors in the word embedding space for some of the words. This embedding space was learned by the model that we’ll define later in the post: sunday delta california boston august time car wednesday continental colorado nashville september schedule rental saturday united florida toronto july times limousine friday american ohio chicago june schedules rentals monday eastern georgia phoenix december dinnertime cars tuesday northwest pennsylvania cleveland november ord taxi thursday us north atlanta april f28 train wednesdays nationair tennessee milwaukee october limo limo saturdays lufthansa minnesota columbus january departure ap sundays midwest michigan minneapolis may sfo later Recurrent Neural Networks Convolutional layers can be a great way to pool local information, but they do not really capture the sequentiality of data. Recurrent Neural Networks (RNNs) help us tackle sequential information like natural language. If we are going to predict properties of the current word, we better remember the words before it too. An RNN has such an internal state/memory that stores the summary of the sequence it has seen so far. This allows us to use RNNs to solve complicated word tagging problems such as Part Of Speech (POS) tagging or slot filling, as in our case. The following diagram illustrates the internals of RNN:  Source: Nature RNN Let's briefly go through the diagram: Is the input to the RNN.   x_1,x_2,...,x_(t-1),x_t,x_(t+1)... Is the hidden state of the RNN at the step.  st This is computed based on the state at the step. t-1 As st=f(Uxt+Ws(t-1)) Here f is a nonlinearity such astanh or ReLU. ot Is the output at the step. t Computed as:ot=f(Vst)U,V,W Are the learnable parameters of RNN. For our problem, we will pass a word embeddings’ sequence as the input to the RNN. Putting it all together Now that we've setup the problem and have an understanding of the basic blocks, let's code it up. Since we are using the IOB representation for labels, it's not simpleto calculate the scores of our model. We therefore use the conlleval perl script to compute the F1 Scores. I've adapted the code from here for the data preprocessing and score calculation. The complete code is available at GitHub: $ git clone https://github.com/chsasank/ATIS.keras.git $ cd ATIS.keras I recommend using jupyter notebook to run and experiment with the snippets from the tutorial. $ jupyter notebook Conclusion In part 2, we will load the data using data.load.atisfull(). We will also define the Keras model, and then we will train the model. To measure the accuracy of the model, we’ll use model.predict_on_batch() and metrics.accuracy.conlleval(). And finally, we will improve our model to achieve better results. About the author Sasank Chilamkurthy works at Fractal Analytics. His work involves deep learning on medical images obtained from radiology and pathology. He is mainly interested in computer vision.
Read more
  • 0
  • 0
  • 4268

article-image-thinking-probabilistically
Packt
04 Oct 2016
16 min read
Save for later

Thinking Probabilistically

Packt
04 Oct 2016
16 min read
In this article by Osvaldo Martin, the author of the book Bayesian Analysis with Python, we will learn that Bayesian statistics has been developing for more than 250 years now. During this time, it has enjoyed as much recognition and appreciation as disdain and contempt. In the last few decades, it has gained an increasing amount of attention from people in the field of statistics and almost all the other sciences, engineering, and even outside the walls of the academic world. This revival has been possible due to theoretical and computational developments; modern Bayesian statistics is mostly computational statistics. The necessity for flexible and transparent models and more intuitive interpretation of the results of a statistical analysis has only contributed to the trend. (For more resources related to this topic, see here.) Here, we will adopt a pragmatic approach to Bayesian statistics and we will not care too much about other statistical paradigms and their relationship with Bayesian statistics. The aim of this book is to learn how to do Bayesian statistics with Python; philosophical discussions are interesting but they have already been discussed elsewhere in a much richer way than we could discuss in these pages. We will use a computational and modeling approach, and we will learn to think in terms of probabilistic models and apply Bayes' theorem to derive the logical consequences of our models and data. Models will be coded using Python and PyMC3, a great library for Bayesian statistics that hides most of the mathematical details of Bayesian analysis from the user. Bayesian statistics is theoretically grounded in probability theory, and hence it is no wonder that many books about Bayesian statistics are full of mathematical formulas requiring a certain level of mathematical sophistication. Nevertheless, programming allows us to learn and do Bayesian statistics with only modest mathematical knowledge. This is not to say that learning the mathematical foundations of statistics is useless; don't get me wrong, that could certainly help you build better models and gain an understanding of problems, models, and results. In this article, we will cover the following topics: Statistical modeling Probabilities and uncertainty Statistical modeling Statistics is about collecting, organizing, analyzing, and interpreting data, and hence statistical knowledge is essential for data analysis. Another useful skill when analyzing data is knowing how to write code in a programming language such as Python. Manipulating data is usually necessary given that we live in a messy world with even more messy data, and coding helps to get things done. Even if your data is clean and tidy, programming will still be very useful since, as will see, modern Bayesian statistics is mostly computational statistics. Most introductory statistical courses, at least for non-statisticians, are taught as a collection of recipes that more or less go like this; go to the the statistical pantry, pick one can and open it, add data to taste and stir until obtaining a consisting p-value, preferably under 0.05 (if you don't know what a p-value is, don't worry; we will not use them in this book). The main goal in this type of course is to teach you how to pick the proper can. We will take a different approach: we will also learn some recipes, but this will be home-made food rather than canned food; we will learn hot to mix fresh ingredients that will suit different gastronomic occasions. But before we can cook we must learn some statistical vocabulary and also some concepts. Exploratory data analysis Data is an essential ingredient of statistics. Data comes from several sources, such as experiments, computer simulations, surveys, field observations, and so on. If we are the ones that will be generating or gathering the data, it is always a good idea to first think carefully about the questions we want to answer and which methods we will use, and only then proceed to get the data. In fact, there is a whole branch of statistics dealing with data collection known as experimental design. In the era of data deluge, we can sometimes forget that getting data is not always cheap. For example, while it is true that the Large Hadron Collider (LHC) produces hundreds of terabytes a day, its construction took years of manual and intellectual effort. In this book we will assume that we already have collected the data and also that the data is clean and tidy, something rarely true in the real world. We will make these assumptions in order to focus on the subject of this book. If you want to learn how to use Python for cleaning and manipulating data and also a primer on statistics and machine learning, you should probably read Python Data Science Handbook by Jake VanderPlas. OK, so let's assume we have our dataset; usually, a good idea is to explore and visualize it in order to get some idea of what we have in our hands. This can be achieved through what is known as Exploratory Data Analysis (EDA), which basically consists of the following: Descriptive statistics Data visualization The first one, descriptive statistics, is about how to use some measures (or statistics) to summarize or characterize the data in a quantitative manner. You probably already know that you can describe data using the mean, mode, standard deviation, interquartile ranges, and so forth. The second one, data visualization, is about visually inspecting the data; you probably are familiar with representations such as histograms, scatter plots, and others. While EDA was originally thought of as something you apply to data before doing any complex analysis or even as an alternative to complex model-based analysis, through the book we will learn that EDA is also applicable to understanding, interpreting, checking, summarizing, and communicating the results of Bayesian analysis. Inferential statistics Sometimes, plotting our data and computing simple numbers, such as the average of our data, is all what we need. Other times, we want to go beyond our data to understand the underlying mechanism that could have generated the data, or maybe we want to make predictions for future data, or we need to choose among several competing explanations for the same data. That's the job of inferential statistics. To do inferential statistics we will rely on probabilistic models. There are many types of model and most of science, and I will add all of our understanding of the real world, is through models. The brain is just a machine that models reality (whatever reality might be) http://www.tedxriodelaplata.org/videos/m%C3%A1quina-construye-realidad. What are models? Models are a simplified descriptions of a given system (or process). Those descriptions are purposely designed to capture only the most relevant aspects of the system, and hence, most models do not try to pretend they are able to explain everything; on the contrary, if we have a simple and a complex model and both models explain the data well, we will generally prefer the simpler one. Model building, no matter which type of model you are building, is an iterative process following more or less the same basic rules. We can summarize the Bayesian modeling process using three steps: Given some data and some assumptions on how this data could have been generated, we will build models. Most of the time, models will be crude approximations, but most of the time this is all we need. Then we will use Bayes' theorem to add data to our models and derive the logical consequences of mixing the data and our assumptions. We say we are conditioning the model on our data. Lastly, we will check that the model makes sense according to different criteria, including our data and our expertise on the subject we are studying. In general, we will find ourselves performing these three steps in a non-linear iterative fashion. Sometimes we will retrace our steps at any given point: maybe we made a silly programming mistake, maybe we found a way to change the model and improve it, maybe we need to add more data. Bayesian models are also known as probabilistic models because they are built using probabilities. Why probabilities? Because probabilities are the correct mathematical tool for dealing with uncertainty in our data and models, so let's take a walk through the garden of forking paths. Probabilities and uncertainty While probability theory is a mature and well-established branch of mathematics, there is more than one interpretation of what probabilities are. To a Bayesian, a probability is a measure that quantifies the uncertainty level of a statement. If we know nothing about coins and we do not have any data about coin tosses, it is reasonable to think that the probability of a coin landing heads could take any value between 0 and 1; that is, in the absence of information, all values are equally likely, our uncertainty is maximum. If we know instead that coins tend to be balanced, then we may say that the probability of acoin landing is exactly 0.5 or may be around 0.5 if we admit that the balance is not perfect. If we collect data, we can update these prior assumptions and hopefully reduce the uncertainty about the bias of the coin. Under this definition of probability, it is totally valid and natural to ask about the probability of life on Mars, the probability of the mass of the electron being 9.1 x 10-31 kg, or the probability of the 9th of July of 1816 being a sunny day. Notice for example that life on Mars exists or not; it is a binary outcome, but what we are really asking is how likely is it to find life on Mars given our data and what we know about biology and the physical conditions on that planet? The statement is about our state of knowledge and not, directly, about a property of nature. We are using probabilities because we can not be sure about the events, not because the events are necessarily random. Since this definition of probability is about our epistemic state of mind, sometimes it is referred to as the subjective definition of probability, explaining the slogan of subjective statistics often attached to the Bayesian paradigm. Nevertheless, this definition does not mean all statements should be treated as equally valid and so anything goes; this definition is about acknowledging that our understanding about the world is imperfect and conditioned by the data and models we have made. There is not such a thing as a model-free or theory-free understanding of the world; even if it will be possible to free ourselves from our social preconditioning, we will end up with a biological limitation: our brain, subject to the evolutionary process, has been wired with models of the world. We are doomed to think like humans and we will never think like bats or anything else! Moreover, the universe is an uncertain place and all we can do is make probabilistic statements about it. Notice that does not matter if the underlying reality of the world is deterministic or stochastic; we are using probability as a tool to quantify uncertainty. Logic is about thinking without making mistakes. In Aristotelian or classical logic, we can only have statements that are true or false. In Bayesian definition of probability, certainty is just a special case: a true statement has a probability of 1, a false one has probability. We would assign a probability of 1 about life on Mars only after having conclusive data indicating something is growing and reproducing and doing other activities we associate with living organisms. Notice, however, that assigning a probability of 0 is harder because we can always think that there is some Martian spot that is unexplored, or that we have made mistakes with some experiment, or several other reasons that could lead us to falsely believe life is absent on Mars when it is not. Interesting enough, Cox mathematically proved that if we want to extend logic to contemplate uncertainty we must use probabilities and probability theory, from which Bayes' theorem is just a logical consequence as we will see soon. Hence, another way of thinking about Bayesian statistics is as an extension of logic when dealing with uncertainty, something that clearly has nothing to do with subjective reasoning in the pejorative sense. Now that we know the Bayesian interpretation of probability, let's see some of the mathematical properties of probabilities. For a more detailed study of probability theory, you can read Introduction to probability by Joseph K Blitzstein & Jessica Hwang. Probabilities are numbers in the interval [0, 1], that is, numbers between 0 and 1, including both extremes. Probabilities follow some rules; one of these rules is the product rule: We read this as follows: the probability of A and B is equal to the probability of A given B, multiplied by the probability of B. The expression p(A|B) is used to indicate a conditional probability; the name refers to the fact that the probability of A is conditioned by knowing B. For example, the probability that a pavement is wet is different from the probability that the pavement is wet if we know (or given that) is raining. In fact, a conditional probability is always larger than or equal to the unconditioned probability. If knowing B does not provides us with information about A, then p(A|B)=p(A). That is A and B are independent of each other. On the contrary, if knowing B give as useful information about A, then p(A|B) > p(A). Conditional probabilities are a key concept in statistics, and understanding them is crucial to understanding Bayes' theorem, as we will see soon. Let's try to understand them from a different perspective. If we reorder the equation for the product rule, we get the following: Hence, p(A|B) is the probability that both A and B happens, relative to the probability of B happening. Why do we divide by p(B)? Knowing B is equivalent to saying that we have restricted the space of possible events to B and thus, to find the conditional probability, we take the favorable cases and divide them by the total number of events. Is important to realize that all probabilities are indeed conditionals, there is not such a thing as an absolute probability floating in vacuum space. There is always some model, assumptions, or conditions, even if we don't notice or know them. The probability of rain is not the same if we are talking about Earth, Mars, or some other place in the Universe, the same way the probability of a coin landing heads or tails depends on our assumptions of the coin being biased in one way or another. Now that we are more familiar with the concept of probability, let's jump to the next topic, probability distributions. Probability distributions A probability distribution is a mathematical object that describes how likely different events are. In general, these events are restricted somehow to a set of possible events. A common and useful conceptualization in statistics is to think that data was generated from some probability distribution with unobserved parameters. Since the parameters are unobserved and we only have data, we will use Bayes' theorem to invert the relationship, that is, to go from the data to the parameters. Probability distributions are the building blocks of Bayesian models; by combining them in proper ways we can get useful complex models. We will meet several probability distributions throughout the book; every time we discover one we will take a moment to try to understand it. Probably the most famous of all of them is the Gaussian or normal distribution. A variable x follows a Gaussian distribution if its values are dictated by the following formula: In the formula, and are the parameters of the distributions. The first one can take any real value, that is, , and dictates the mean of the distribution (and also the median and mode, which are all equal). The second is the standard deviation, which can only be positive and dictates the spread of the distribution. Since there are an infinite number of possible combinations of and values, there is an infinite number of instances of the Gaussian distribution and all of them belong to the same Gaussian family. Mathematical formulas are concise and unambiguous and some people say even beautiful, but we must admit that meeting them can be intimidating; a good way to break the ice is to use Python to explore them. Let's see what the Gaussian distribution family looks like: import matplotlib.pyplot as plt import numpy as np from scipy import stats import seaborn as sns mu_params = [-1, 0, 1] sd_params = [0.5, 1, 1.5] x = np.linspace(-7, 7, 100) f, ax = plt.subplots(len(mu_params), len(sd_params), sharex=True, sharey=True) for i in range(3): for j in range(3): mu = mu_params[i] sd = sd_params[j] y = stats.norm(mu, sd).pdf(x) ax[i,j].plot(x, y) ax[i,j].set_ylim(0, 1) ax[i,j].plot(0, 0, label="$\alpha$ = {:3.2f}n$\beta$ = {:3.2f}".format(mu, sd), alpha=0) ax[i,j].legend() ax[2,1].set_xlabel('$x$') ax[1,0].set_ylabel('$pdf(x)$') The output of the preceding code is as follows: A variable, such as x, that comes from a probability distribution is called a random variable. It is not that the variable can take any possible value. On the contrary, the values are strictly dictated by the probability distribution; the randomness arises from the fact that we could not predict which value the variable will take, but only the probability of observing those values. A common notation used to say that a variable is distributed as a Gaussian or normal distribution with parameters and is as follows: The symbol ~ is read as is distributed as. There are two types of random variable, continuous and discrete. Continuous variables can take any value from some interval (we can use Python floats to represent them), and the discrete variables can take only certain values (we can use Python integers to represent them). Many models assume that successive values of a random variables are all sampled from the same distribution and those values are independent of each other. In such a case, we will say that the variables are independently and identically distributed, or iid variables for short. Using mathematical notation, we can see that two variables are independent if for every value of x and y: A common example of non iid variables are temporal series, where a temporal dependency in the random variable is a key feature that should be taken into account. Summary In this article we shall take up a practical approach to Bayesian statistics and discover how to implement Bayesian statistics with Python. Here we will learn to think of problems in terms of their probability and uncertainty and apply the Bayes' theorem to derive their results. Resources for Article: Further resources on this subject: Python Data Science Up and Running [article] Mining Twitter with Python – Influence and Engagement [article] Exception Handling in MySQL for Python [article]
Read more
  • 0
  • 0
  • 1467
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at AU $19.99/month. Cancel anytime
article-image-supervised-machine-learning
Packt
04 Oct 2016
13 min read
Save for later

Supervised Machine Learning

Packt
04 Oct 2016
13 min read
In this article by Anshul Joshi, the author of the book Julia for Data Science, we will learn that data science involves understanding data, gathering data, munging data, taking the meaning out of that data, and then machine learning if needed. Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. (For more resources related to this topic, see here.) The key features offered by Julia are: A general purpose high-level dynamic programming language designed to be effective for numerical and scientific computing A Low-Level Virtual Machine (LLVM) based Just-in-Time (JIT) compiler that enables Julia to approach the performance of statically-compiled languages like C/C++ What is machine learning? Generally, when we talk about machine learning, we get into the idea of us fighting wars with intelligent machines that we created but went out of control. These machines are able to outsmart the human race and become a threat to human existence. These theories are nothing but created for our entertainment. We are still very far away from such machines. So, the question is: what is machine learning? Tom M. Mitchell gave a formal definition- "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E." It says that machine learning is teaching computers to generate algorithms using data without programming them explicitly. It transforms data into actionable knowledge. Machine learning has close association with statistics, probability, and mathematical optimization. As technology grew, there is one thing that grew with it exponentially—data. We have huge amounts of unstructured and structured data growing at a very great pace. Lots of data is generated by space observatories, meteorologists, biologists, fitness sensors, surveys, and so on. It is not possible to manually go through this much amount of data and find patterns or gain insights. This data is very important for scientists, domain experts, governments, health officials, and even businesses. To gain knowledge out of this data, we need self-learning algorithms that can help us in decision making. Machine learning evolved as a subfield of artificial intelligence, which eliminates the need to manually analyze large amounts of data. Instead of using machine learning, we make data-driven decisions by gaining knowledge using self-learning predictive models. Machine learning has become important in our daily lives. Some common use cases include search engines, games, spam filters, and image recognition. Self-driving cars also use machine learning. Some basic terminologies used in machine learning: Features: Distinctive characteristics of the data point or record Training set: This is the dataset that we feed to train the algorithm that helps us to find relationships or build a model Testing set: The algorithm generated using the training dataset is tested on the testing dataset to find the accuracy Feature vector: An n-dimensional vector that contains the features defining an object Sample: An item from the dataset or the record Uses of machine learning Machine learning in one way or another is used everywhere. Its applications are endless. Let's discuss some very common use cases: E-mail spam filtering: Every major e-mail service provider uses machine learning to filter out spam messages from the Inbox to the Spam folder. Predicting storms and natural disasters: Machine learning is used by meteorologists and geologists to predict the natural disasters using weather data, which can help us to take preventive measures. Targeted promotions/campaigns and advertising: On social sites, search engines, and maybe in mailboxes, we see advertisements that somehow suit our taste. This is made feasible using machine learning on the data from our past searches, our social profile or the e-mail contents. Self-driving cars: Technology giants are currently working on self driving cars. This is made possible using machine learning on the feed of the actual data from human drivers, image and sound processing, and various other factors. Machine learning is also used by businesses to predict the market. It can also be used to predict the outcomes of elections and the sentiment of voters towards a particular candidate. Machine learning is also being used to prevent crime. By understanding the pattern of the different criminals, we can predict a crime that can happen in future and can prevent it. One case that got a huge amount of attention was of a big retail chain in the United States using machine learning to identify pregnant women. The retailer thought of the strategy to give discounts on multiple maternity products, so that they would become loyal customers and will purchase items for babies which have a high profit margin. The retailer worked on the algorithm to predict the pregnancy using useful patterns in purchases of different products which are useful for pregnant women. Once a man approached the retailer and asked for the reason that his teenage daughter is receiving discount coupons for maternity items. The retail chain offered an apology but later the father himself apologized when he got to know that his daughter was indeed pregnant. This story may or may not be completely true, but retailers indeed analyze their customers' data routinely to find out patterns and for targeted promotions, campaigns, and inventory management. Machine learning and ethics Let's see where machine learning is used very frequently: Retailers: In the previous example, we mentioned how retail chains use data for machine learning to increase their revenue as well as to retain their customers. Spam filtering: E-mails are processed using various machine learning algorithms for spam filtering. Targeted advertisements: In our mailbox, social sites, or search engines, we see advertisements of our liking. These are only some of the actual use cases that are implemented in the world today. One thing that is common between them is the user data. In the first example, retailers are using the history of transactions done by the user for targeted promotions and campaigns and for inventory management, among other things. Retail giants do this by providing users a loyalty or sign-up card. In the second example, the e-mail service provider uses trained machine learning algorithms to detect and flag spam. It does by going through the contents of e-mail/attachments and classifying the sender of the e-mail. In the third example, again the e-mail provider, social network, or search engine will go through our cookies, our profile, or our mails to do the targeted advertising. In all of these examples, it is mentioned in the terms and conditions of the agreement when we sign up with the retailer, e-mail provider, or social network that the user's data will be used but privacy will not be violated. It is really important that before using data that is not publicly available, we take the required permissions. Also, our machine learning models shouldn't do discrimination on the basis of region, race, and sex or of any other kind. The data provided should not be used for purposes not mentioned in the agreement or illegal in the region or country of existence. Machine learning – the process Machine learning algorithms are trained in keeping with the idea of how the human brain works. They are somewhat similar. Let's discuss the whole process. The machine learning process can be described in three steps: Input Abstraction Generalization These three steps are the core of how the machine learning algorithm works. Although the algorithm may or may not be divided or represented in such a way, this explains the overall approach. The first step concentrates on what data should be there and what shouldn't. On the basis of that, it gathers, stores, and cleans the data as per the requirements. The second step involves that the data be translated to represent the bigger class of data. This is required as we cannot capture everything and our algorithm should not be applicable for only the data that we have. The third step focuses on the creation of the model or an action that will use this abstracted data, which will be applicable for the broader mass. So, what should be the flow of approaching a machine learning problem? In this particular figure, we see that the data goes through the abstraction process before it can be used to create the machine learning algorithm. This process itself is cumbersome. The process follows the training of the model, which is fitting the model into the dataset that we have. The computer does not pick up the model on its own, but it is dependent on the learning task. The learning task also includes generalizing the knowledge gained on the data that we don't have yet. Therefore, training the model is on the data that we currently have and the learning task includes generalization of the model for future data. It depends on our model how it deduces knowledge from the dataset that we currently have. We need to make such a model that can gather insights into something that wasn't known to us before and how it is useful and can be linked to the future data. Different types of machine learning Machine learning is divided mainly into three categories: Supervised learning Unsupervised learning Reinforcement learning In supervised learning, the model/machine is presented with inputs and the outputs corresponding to those inputs. The machine learns from these inputs and applies this learning in further unseen data to generate outputs. Unsupervised learning doesn't have the required outputs; therefore it is up to the machine to learn and find patterns that were previously unseen. In reinforcement learning, the machine continuously interacts with the environment and learns through this process. This includes a feedback loop. Understanding decision trees Decision tree is a very good example of divide and conquer. It is one of the most practical and widely used methods for inductive inference. It is a supervised learning method that can be used for both classification and regression. It is non-parametric and its aim is to learn by inferring simple decision rules from the data and create such a model that can predict the value of the target variable. Before taking a decision, we analyze the probability of the pros and cons by weighing the different options that we have. Let's say we want to purchase a phone and we have multiple choices in the price segment. Each of the phones has something really good, and maybe better than the other. To make a choice, we start by considering the most important feature that we want. And like this, we create a series of features that it has to pass to become the ultimate choice. In this section, we will learn about: Decision trees Entropy measures Random forests We will also learn about famous decision tree learning algorithms such as ID3 and C5.0. Decision tree learning algorithms There are various decision tree learning algorithms that are actually variations of the core algorithm. The core algorithm is actually a top-down, greedy search through all possible trees. We are going to discuss two algorithms: ID3 C4.5 and C5.0 The first algorithm, Iterative Dichotomiser 3 (ID3), was developed by Ross Quinlan in 1986. The algorithm proceeds by creating a multiway tree, where it uses greedy search to find each node and the features that can yield maximum information gain for the categorical targets. As trees can grow to the maximum size, which can result in over-fitting of data, pruning is used to make the generalized model. C4.5 came after ID3 and eliminated the restriction that all features must be categorical. It does this by defining dynamically a discrete attribute based on the numerical variables. This partitions into a discrete set of intervals from the continuous attribute value. C4.5 creates sets of if-then rules from the trained trees of the ID3 algorithm. C5.0 is the latest version; it builds smaller rule sets and uses comparatively lesser memory. An example Let's apply what we've learned to create a decision tree using Julia. We will be using the example available for Python on scikit-learn.org and Scikitlearn.jl by Cedric St-Jean. We will first have to add the required packages: We will first have to add the required packages: julia> Pkg.update() julia> Pkg.add("DecisionTree") julia> Pkg.add("ScikitLearn") julia> Pkg.add("PyPlot") ScikitLearn provides the interface to the much-famous library of machine learning for Python to Julia: julia> using ScikitLearn julia> using DecisionTree julia> using PyPlot After adding the required packages, we will create the dataset that we will be using in our example: julia> # Create a random dataset julia> srand(100) julia> X = sort(5 * rand(80)) julia> XX = reshape(X, 80, 1) julia> y = sin(X) julia> y[1:5:end] += 3 * (0.5 – rand(16)) This will generate a 16-element Array{Float64,1}. Now we will create instances of two different models. One model is where we will not limit the depth of the tree, and in other model, we will prune the decision tree on the basis of purity: We will now fit the models to the dataset that we have. We will fit both the models. This is the first model. Here our decision tree has 25 leaf nodes and a depth of 8. This is the second model. Here we prune our decision tree. This has six leaf nodes and a depth of 4. Now we will use the models to predict on the test dataset: julia> # Predict julia> X_test = 0:0.01:5.0 julia> y_1 = predict(regr_1, hcat(X_test)) julia> y_2 = predict(regr_2, hcat(X_test)) This creates a 501-element Array{Float64,1}. To better understand the results, let's plot both the models on the dataset that we have: julia> # Plot the results julia> scatter(X, y, c="k", label="data") julia> plot(X_test, y_1, c="g", label="no pruning", linewidth=2) julia> plot(X_test, y_2, c="r", label="pruning_purity_threshold=0.05", linewidth=2) julia> xlabel("data") julia> ylabel("target") julia> title("Decision Tree Regression") julia> legend(prop=Dict("size"=>10)) Decision trees can tend to overfit data. It is required to prune the decision tree to make it more generalized. But if we do more pruning than required, then it may lead to an incorrect model. So, it is required that we find the most optimized pruning level. It is quite evident that the first decision tree overfits to our dataset, whereas the second decision tree model is comparatively more generalized. Summary In this article, we learned about machine learning and its uses. Providing computers the ability to learn and improve has far-reaching uses in this world. It is used in predicting disease outbreaks, predicting weather, games, robots, self-driving cars, personal assistants, and lot more. There are three different types of machine learning: supervised learning, unsupervised learning, and reinforcement learning. We also learned about decision trees. Resources for Article: Further resources on this subject: Specialized Machine Learning Topics [article] Basics of Programming in Julia [article] More about Julia [article]
Read more
  • 0
  • 0
  • 1276

article-image-parallel-computing
Packt
30 Sep 2016
9 min read
Save for later

Parallel Computing

Packt
30 Sep 2016
9 min read
In this article written by Jalem Raj Rohit, author of the book Julia Cookbook, cover the following recipes: Basic concepts of parallel computing Data movement Parallel map and loop operations Channels (For more resources related to this topic, see here.) Introduction In this article, you will learn about performing parallel computing and using it to handle big data. So, some concepts like data movements, sharded arrays, and the map-reduce framework are important to know in order to handle large amounts of data by computing on it using parallelized CPUs. So, all the concepts discussed in this article will help you build good parallel computing and multiprocessing basics, including efficient data handling and code optimization. Basic concepts of parallel computing Parallel computing is a way of dealing with data in a parallel way. This can be done by connecting multiple computers as a cluster and using their CPUs for carrying out the computations. This style of computation is used when handling large amounts of data and also while running complex algorithms over significantly large data. The computations are executed faster due to the availability of multiple CPUs running them in parallel as well as the direct availability of RAM to each of them. Getting ready Julia has an in-built support for parallel computing and multiprocessing. So, these computations rarely require any external libraries for the task. How to do it… Julia can be started in your local computer using multiple cores of your CPU. So, we will now have multiple workers for the process. This is how you can fire up Julia in the multi-processing mode in your terminal. This creates two worker process in the machine, which means it uses twwo CPU cores for the purpose julia -p 2 The output looks something like this. It might differ for different operating systems and different machines: Now, we will look at the remotecall() function. It takes in multiple arguments, the first one being the process which we want to assign the task to. The next argument would be the function which we want to execute. The subsequent arguments would be the parameters or the arguments of that function which we want to execute. In this example, we will create a 2 x 2 random matrix and assign it to the process number 2. This can be done as follows: task = remotecall(2, rand, 2, 2) The preceding command gives the following output: Now that the remotecall() function for remote referencing has been executed, we will fetch the results of the function through the fetch() function. This can be done as follows: fetch(task) The preceding command gives the following output: Now, to perform some mathematical operations on the generated matrix, we can use the @spawnat macro, which takes in the mathematical operation and the fetch() function. The @spawnat macro actually wraps the expression 5 .+ fetch(task) into an anonymous function and runs it on the second machine This can be done as follows: task2 = @spawnat 5 .+ fetch(task) There is also a function that eliminates the need of using two different functions: remotecall() and fetch(). The remotecall_fetch() function takes in multiple arguments. The first one being the process that the task is being assigned. The next argument is the function which you want to be executed. The subsequent arguments would be the arguments or the parameters of the function that you want to execute. Now, we will use the remote call_fetch() function to fetch an element of the task matrix for a particular index. This can be done as follows: remotecall_fetch(2, getindex, task2, 1, 1) How it works… Julia can be started in the multiprocessing mode by specifying the number of processes needed while starting up the REPL. In this example, we started Julia as a two process mode. The maximum number of processes depends on the number of cores available in the CPU. The remotecall() function helps in selecting a particular process from the running processes in order to run a function or, in fact, any computation for us. The fetch() function is used to fetch the results of the remotecall() function from a common data resource (or the process) for all the running processes. The details of the data source would be covered in the later sections. The results of the fetch() function can also be used for further computations, which can be carried out with the @spawnat macro along with the results of fetch(). This would assign a process for the computation. The remotecall_fetch() function further eliminates the need for the fetch function in case of a direct execution. This has both the remotecall() and fetch() operations built into it. So, it acts as a combination of both the second and third points in this section. Data movement In parallel computing, data movements are quite common and are also a thing to be minimized due to the time and the network overhead due to the movements. In this recipe, we will see how that can be optimized to avoid latency as much as we can. Getting ready To get ready for this recipe, you need to have the Julia REPL started in the multiprocessing mode. This is explained in the Getting ready section of the preceding recipe. How to do it… Firstly, we will see how to do a matrix computation using the @spawn macro, which helps in data movement. So, we construct a matrix of shape 200 x 200 and then try to square it using the @spawn macro. This can be done as follows: mat = rand(200, 200) exec_mat = @spawn mat^2 fetch(exec_mat) The preceding command gives the following output: Now, we will look at an another way to achieve the same. This time, we will use the @spawn macro directly instead of the initialization step. We will discuss the advantages and drawbacks of each method in the How it works… section. So, this can be done as follows: mat = @spawn rand(200, 200)^2 fetch(mat) The preceding command gives the following output: How it works… In this example, we try to construct a 200X200 matrix and then used the @spawn macro to spawn a process in the CPU to execute the same for us. The @spawn macro spawns one of the two processes running, and it uses one of them for the computation. In the second example, you learned how to use the @spawn macro directly without an extra initialization part. The fetch() function helps us fetch the results from a common data resource of the processes. More on this will be covered in the following recipes. Parallel maps and loop operations In this recipe, you will learn a bit about the famous Map Reduce framework and why it is one of the most important ideas in the domains of big data and parallel computing. You will learn how to parallelize loops and use reducing functions on them through the several CPUs and machines and the concept of parallel computing, which you learned about in the previous recipes. Getting ready Just like the previous sections, Julia just needs to be running in the multiprocessing mode to follow along the following examples. This can be done through the instructions given in the first section. How to do it… Firstly, we will write a function that takes and adds n random bits. The writing of this function has nothing to do with multiprocessing. So, it has simple Julia functions and loops. This function can be written as follows: Now, we will use the @spawn macro, which we learned previously to run the count_heads() function as separate processes. The count_heads()function needs to be in the same directory for this to work. This can be done as follows: require("count_heads") a = @spawn count_heads(100) b = @spawn count_heads(100) fetch(a) + fetch(b) However, we can use the concept of multi-processing and parallelize the loop directly as well as take the sum. The parallelizing part is called mapping, and the addition of the parallelized bits is called reduction. Thus, the process constitutes the famous Map-Reduce framework. This can be made possible using the @parallel macro, as follows: nheads = @parallel (+) for i = 1:200 Int(rand(Bool)) end How it works… The first function is a simple Julia function that adds random bits with every loop iteration. It was created just for the demonstration of Map-Reduce operations. In the second point, we spawn two separate processes for executing the function and then fetch the results of both of them and add them up. However, that is not really a neat way to carry out parallel computation of functions and loops. Instead, the @parallel macro provides a better way to do it, which allows the user to parallelize the loop and then reduce the computations through an operator, which together would be called the Map-Reduce operation. Channels Channels are like the background plumbing for parallel computing in Julia. They are like the reservoirs from where the individual processes access their data from. Getting ready The requisite is similar to the previous sections. This is mostly a theoretical section, so you just need to run your experiments on your own. For that, you need to run your Julia REPL in a multiprocessing mode. How to do it… Channels are shared queues with a fixed length. They are common data reservoirs for the processes which are running. The channels are like common data resources, which multiple readers or workers can access. They can access the data through the fetch() function, which we already discussed in the previous sections. The workers can also write to the channel through the put!() function. This means that the workers can add more data to the resource, which can be accessed by all the workers running a particular computation. Closing a channel after usage is a good practice to avoid data corruption and unnecessary memory usage. It can be done using the close() function. Summary In this article we covered the basic concepts of parallel computing and data movement that takes place in the network. We also learned about parallel maps and loop operations along with the famous Map Reduce framework. At the end we got a brief understanding of channels and how individual processes access their data from channels. Resources for Article: Further resources on this subject: More about Julia [article] Basics of Programming in Julia [article] Simplifying Parallelism Complexity in C# [article]
Read more
  • 0
  • 0
  • 1027

article-image-deep-learning-torch
Preetham Sreenivas
29 Sep 2016
10 min read
Save for later

Deep Learning with Torch

Preetham Sreenivas
29 Sep 2016
10 min read
Torch is a scientific computing framework built on top of Lua[JIT]. The nn package and the ecosystem around it provide a very powerful framework for building deep learning models, striking a perfect balance between speed and flexibility. It is used at Facebook AI Research(FAIR), Twitter Cortex, DeepMind, Yann LeCun's group at NYU, Fei-Fei Li's at Stanford, and many more industrial and academic labs. If you are like me, and don't like writing equations for backpropagation every time you want to try a simple model, Torch is a great solution. With Torch, you can also do pretty much anything you can imagine, whether that is writing custom loss functions, dreaming up an arbitrary acyclic graph network, or even using multiple GPUs or loading pre-trained models on imagenet from caffe model-zoo (yes, you can load models trained in caffe with a single line). Without further ado, let's jump right into the awesome world of deep learning. Prerequisites Some knowledge of deep learning—A Primer, Bengio's deep learning book, Hinton's Coursera course. A bit of Lua. Its syntax is very C-like and can be picked up fairly quickly if you know Python or JavaScript—Learn Lua in 15 minutes, Torch For Numpy Users. A machine with Torch installed since this is intended to be hands-on. On Ubuntu 12+ and Mac OS X, installing Torch looks like this: # in a terminal, run the commands WITHOUT sudo $ git clone https://github.com/torch/distro.git ~/torch --recursive $ cd ~/torch; bash install-deps; $ ./install.sh # On Linux with bash $ source ~/.bashrc # On OSX or in Linux with no bash. $ source ~/.profile Once you’ve installed Torch, you can run a Torch script using: $ th script.lua # alternatively you can fire up a terminal torch interpreter using th -i $ th -i # and run multiple scripts one by one, the variables will be accessible to other scripts > dofile 'script1.lua' > dofile 'script2.lua' > print(variable) -- variable from either of these scripts. The sections below are very code intensive, but you can run these commands from Torch's terminal interpreter. $th -i Building a Model: The Basics A module is the basic building block of any Torch model. It has forward and backward methods for forward and backward passes of backpropagation. You can combine them using containers, and of course, calling forward and backward on containers propagates inputs and gradients correctly. -- A simple mlp model with sigmoids require 'nn' linear1 = nn.Linear(100,10) -- A linear layer Module linear2 = nn.Linear(10,2) -- You can combine modulues using containers, sequential is the most used one model = nn.Sequential() -- A container model:add(linear1) model:add(nn.Sigmoid()) model:add(linear2) model:add(nn.Sigmoid()) -- the forward step input = torch.rand(100) target = torch.rand(2) output = linear:forward(input) Now we need a criterion to measure how well our model is performing, in other words, a loss function. nn.Criterion is the abstract class that all loss functions inherit. It provides forward and backward methods, computing loss and gradients respectively. Torch provides most of the commonly used criterions out of the box. It isn't much of an effort to write your own either. criterion = nn.MSECriterioin() -- mean squared error criterion loss = criterion:forward(output,target) gradientsAtOutput = criterion:backward(output,target) -- To perform the backprop step, we need to pass these gradients to the backward -- method of the model gradAtInput = model:backward(input,gradientsAtOutput) lr = 0.1 -- learning rate for our model model:updateParameters(lr) -- updates the parameters using the lr parameter. The updateParameters method just subtracts the model parameters by gradients scaled by the learning rate. This is the vanilla stochastic gradient descent. Typically, the updates we do are more complex. For example, if we want to use momentum, we need to keep a track of updates we did in the previous epoch. There are a lot more fancy optimization schemes such as RMSProp, adam, adagrad, and L-BFGS that do more complex things like adapting learning rate, momentum factor, and so on. The optim package provides optimization routines out of the box. Dataset We'll use the German Traffic Sign Recognition Benchmark(GTSRB) dataset. This dataset has 43 classes of traffic signs of varying sizes, illuminations and occlusions. There are 39,000 training images and 12,000 test images. Traffic signs in each of the images are not centered and they have a 10% border around them. I have included a shell script for downloading the data along with the code for this tutorial in this github repo.[1] git clone https://github.com/preethamsp/tutorial.gtsrb.torch.git cd tutorial.gtsrb.torch/datasets bash download_gtsrb.sh Model Let's build a downsized vgg style model with what we've learned. function createModel() require 'nn' nbClasses = 43 local net = nn.Sequential() --[[building block: adds a convolution layer, batch norm layer and a relu activation to the net]]-- function ConvBNReLU(nInputPlane, nOutputPlane) The code in the repo is much more polished than the snippets in the tutorial. It is modular and allows you to change the model and/or datasets easily. -- kernel size = (3,3), stride = (1,1), padding = (1,1) net:add(nn.SpatialConvolution(nInputPlane, nOutputPlane, 3,3, 1,1, 1,1)) net:add(nn.SpatialBatchNormalization(nOutputPlane,1e-3)) net:add(nn.ReLU(true)) end ConvBNReLU(3,32) ConvBNReLU(32,32) net:add(nn.SpatialMaxPooling(2,2,2,2)) net:add(nn.Dropout(0.2)) ConvBNReLU(32,64) ConvBNReLU(64,64) net:add(nn.SpatialMaxPooling(2,2,2,2)) net:add(nn.Dropout(0.2)) ConvBNReLU(64,128) ConvBNReLU(128,128) net:add(nn.SpatialMaxPooling(2,2,2,2)) net:add(nn.Dropout(0.2)) net:add(nn.View(128*6*6)) net:add(nn.Dropout(0.5)) net:add(nn.Linear(128*6*6,512)) net:add(nn.BatchNormalization(512)) net:add(nn.ReLU(true)) net:add(nn.Linear(512,nbClasses)) net:add(nn.LogSoftMax()) return net end The first layer contains three input channels because we're going to pass RGB images (three channels). For grayscale images, the first layer has one input channel. I encourage you to play around and modify the network.[2] There are a bunch of new modules that need some elaboration. The Dropout module randomly deactivates a neuron with some probability. It is known to help generalization by preventing co-adaptation between neurons; that is, a neuron should now depend less on its peer, forcing it to learn a bit more. BatchNormalization is a very recent development. It is known to speed up convergence by normalizing the outputs of a layer to unit gaussian using the statistics of a batch. Let’s use this model and train it. In the interest of brievity, I'll use these constructs directly. The code describing these constructs is in datasets/gtsrb.lua. DataGen:trainGenerator(batchSize) DataGen:valGenerator(batchSize) These provide iterators over batches of train and test data respectively. You'll find that the model code (models/vgg_small.lua) in the repo is different. It is designed to allow you to experiment quickly. Using optim to train the model Using a stochastic gradient descent (sgd) from the optim package to minimize a function f looks like this: optim.sgd(feval, params, optimState) Where: feval: A user-defined function that respects the API: f, df/params = feval(params) params: The current parameter vector (a 1D torch.Tensor) optimState: A table of parameters, and state variables, dependent upon the algorithm Since we are optimizing the loss of the neural network, parameters should be the weights and other parameters of the network. We get these as a flattened 1D tensor using model:getParameters. It also returns a tensor containing the gradients of these parameters. This is useful in creating the feval function above. model = createModel() criterion = nn.ClassNLLCriterion() -- criterion we are optimizing: negative log loss params, gradParams = model:getParameters() local function feval() -- criterion.output stores the latest output of criterion return criterion.output, gradParams end We need to create an optimState table and initialize it with a configuration of our optimizer like learning rate and momentum: optimState = { learningRate = 0.01, momentum = 0.9, dampening = 0.0, nesterov = true, } Now, an update to the model should do the following: Compute the output of the model using model:forward(). Compute the loss and the gradients at output layer using criterion:forward() and criterion:backward() respectively. Update the gradients of the model parameters using model:backward(). Update the model using optim.sgd. -- Forward pass output = model:forward(input) loss = criterion:forward(output, target) -- Backward pass critGrad = criterion:backward(output, target) model:backward(input, critGrad) -- Updates optim.sgd(feval, params, optimState) Note: The order above should be respected, as backward assumes forward was run just before it. Changing this order might result in gradients not being computed correctly. Putting it all together Let's put it all together and write a function that trains the model for an epoch. We'll create a loop that iterates over the train data in batches and updates the model. model = createModel() criterion = nn.ClassNLLCriterion() dataGen = DataGen('datasets/GTSRB/') -- Data generator params, gradParams = model:getParameters() batchSize = 32 optimState = { learningRate = 0.01, momentum = 0.9, dampening = 0.0, nesterov = true, } function train() -- Dropout and BN behave differently during training and testing -- So, switch to training mode model:training() local function feval() return criterion.output, gradParams end for input, target in dataGen:trainGenerator(batchSize) do -- Forward pass local output = model:forward(input) local loss = criterion:forward(output, target) -- Backward pass model:zeroGradParameters() -- clear grads from previous update local critGrad = criterion:backward(output, target) model:backward(input, critGrad) -- Updates optim.sgd(feval, params, optimState) end end The test function is extremely similar, except that we don't need to update the parameters: confusion = optim.ConfusionMatrix(nbClasses) -- to calculate accuracies function test() model:evaluate() -- switch to evaluate mode confusion:zero() -- clear confusion matrix for input, target in dataGen:valGenerator(batchSize) do local output = model:forward(input) confusion:batchAdd(output, target) end confusion:updateValids() local test_acc = confusion.totalValid * 100 print(('Test accuracy: %.2f'):format(test_acc)) end Now that everything is set, you can train your network and print the test accuracies: max_epoch = 20 for i = 1,20 do train() test() end An epoch takes around 30 seconds on a TitanX and gives about 97.7% accuracy after 20 epochs. This is a very basic model and honestly I haven't tried optimizing the parameters much. There are a lot of things that can be done to crank up the accuracies. Try different processing procedures. Experiment with the net structure. Different weight initializations, and learning rate schedules. An Ensemble of different models; for example, train multiple models and take a majority vote. You can have a look at the state of the art on this dataset here. They achieve upwards of 99.5% accuracy using a clever method to boost the geometric variation of CNNs. Conclusion We looked at how to build a basic mlp in Torch. We then moved on to building a Convolutional Neural Network and trained it to solve a real-world problem of traffic sign recognition. For a beginner, Torch/LUA might not be as easy. But once you get a hang of it, you have access to a deep learning framework which is very flexible yet fast. You will be able to easily reproduce latest research or try new stuff unlike in rigid frameworks like keras or nolearn. I encourage you to give it a fair try if you are going anywhere near deep learning. Resources Torch Cheat Sheet Awesome Torch Torch Blog Facebook's Resnet Code Oxford's ML Course Practicals Learn torch from Github repos About the author Preetham Sreenivas is a data scientist at Fractal Analytics. Prior to that, he was a software engineer at Directi.
Read more
  • 0
  • 0
  • 4630

article-image-unsupervised-learning
Packt
28 Sep 2016
11 min read
Save for later

Unsupervised Learning

Packt
28 Sep 2016
11 min read
In this article by Bastiaan Sjardin, Luca Massaron, and Alberto Boschetti, the authors of the book Large Scale Machine Learning with Python, we will try to create new features and variables at scale in the observation matrix. We will introduce the unsupervised methods and illustrate principal component analysis (PCA)—an effective way to reduce the number of features. (For more resources related to this topic, see here.) Unsupervised methods Unsupervised learning is a branch of machine learning whose algorithms reveal inferences from data without an explicit label (unlabeled data). The goal of such techniques is to extract hidden patterns and group similar data. In these algorithms, the unknown parameters of interests of each observation (the group membership and topic composition, for instance) are often modeled as latent variables (or a series of hidden variables), hidden in the system of observed variables that cannot be observed directly, but only deduced from the past and present outputs of the system. Typically, the output of the system contains noise, which makes this operation harder. In common problems, unsupervised methods are used in two main situations: With labeled datasets to extract additional features to be processed by the classifier/regressor down to the processing chain. Enhanced by additional features, they may perform better. With labeled or unlabeled datasets to extract some information about the structure of the data. This class of algorithms is commonly used during the Exploratory Data Analysis (EDA) phase of the modeling. First at all, before starting with our illustration, let's import the modules that will be necessary along the article in our notebook: In : import matplotlib import numpy as np import pandas as pd import matplotlib.pyplot as plt from matplotlib import pylab %matplotlib inline import matplotlib.cm as cm import copy import tempfile import os   Feature decomposition – PCA PCA is an algorithm commonly used to decompose the dimensions of an input signal and keep just the principal ones. From a mathematical perspective, PCA performs an orthogonal transformation of the observation matrix, outputting a set of linear uncorrelated variables, named principal components. The output variables form a basis set, where each component is orthonormal to the others. Also, it's possible to rank the output components (in order to use just the principal ones) as the first component is the one containing the largest possible variance of the input dataset, the second is orthogonal to the first (by definition) and contains the largest possible variance of the residual signal, and the third is orthogonal to the first two and contains the largest possible variance of the residual signal, and so on. A generic transformation with PCA can be expressed as a projection to a space. If just the principal components are taken from the transformation basis, the output space will have a smaller dimensionality than the input one. Mathematically, it can be expressed as follows: Here, X is a generic point of the training set of dimension N, T is the transformation matrix coming from PCA, and  is the output vector. Note that the symbol indicates a dot product in this matrix equation. From a practical perspective, also note that all the features of X must be zero-centered before doing this operation. Let's now start with a practical example; later, we will explain math PCA in depth. In this example, we will create a dummy dataset composed of two blobs of points—one cantered in (-5, 0) and the other one in (5,5).Let's use PCA to transform the dataset and plot the output compared to the input. In this simple example, we will use all the features, that is, we will not perform feature reduction: In:from sklearn.datasets.samples_generator import make_blobs from sklearn.decomposition import PCA X, y = make_blobs(n_samples=1000, random_state=101, centers=[[-5, 0], [5, 5]]) pca = PCA(n_components=2) X_pca = pca.fit_transform(X) pca_comp = pca.components_.T test_point = np.matrix([5, -2]) test_point_pca = pca.transform(test_point) plt.subplot(1, 2, 1) plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='none') plt.quiver(0, 0, pca_comp[:,0], pca_comp[:,1], width=0.02, scale=5, color='orange') plt.plot(test_point[0, 0], test_point[0, 1], 'o') plt.title('Input dataset') plt.subplot(1, 2, 2) plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, edgecolors='none') plt.plot(test_point_pca[0, 0], test_point_pca[0, 1], 'o') plt.title('After "lossless" PCA') plt.show()   As you can see, the output is more organized than the original features' space and, if the next task is a classification, it would require just one feature of the dataset, saving almost 50% of the space and computation needed. In the image, you can clearly see the core of PCA: it's just a projection of the input dataset to the transformation basis drawn in the image on the left in orange. Are you unsure about this? Let's test it: In:print "The blue point is in", test_point[0, :] print "After the transformation is in", test_point_pca[0, :] print "Since (X-MEAN) * PCA_MATRIX = ", np.dot(test_point - pca.mean_, pca_comp) Out:The blue point is in [[ 5 -2]] After the transformation is in [-2.34969911 -6.2575445 ] Since (X-MEAN) * PCA_MATRIX = [[-2.34969911 -6.2575445 ]   Now, let's dig into the core problem: how is it possible to generate T from the training set? It should contain orthonormal vectors, and the vectors should be ranked according the quantity of variance (that is, the energy or information carried by the observation matrix) that they can explain. Many solutions have been implemented, but the most common implementation is based on Singular Value Decomposition (SVD). SVD is a technique that decomposes any matrix M into three matrixes () with special properties and whose multiplication gives back M again: Specifically, given M, a matrix of m rows and n columns, the resulting elements of the equivalence are as follows: U is a matrix m x m (square matrix), it's unitary, and its columns form an orthonormal basis. Also, they're named left singular vectors, or input singular vectors, and they're the eigenvectors of the matrix product .  is a matrix m x n, which has only non-zero elements on its diagonal. These values are named singular values, are all non-negative, and are the eigenvalues of both  and . W is a unitary matrix n x n (square matrix), its columns form an orthonormal basis, and they're named right (or output) singular vectors. Also, they are the eigenvectors of the matrix product . Why is this needed? The solution is pretty easy: the goal of PCA is to try and estimate the directions where the variance of the input dataset is larger. For this, we first need to remove the mean from each feature and then operate on the covariance matrix . Given that, by decomposing the matrix X with SVD, we have the columns of the matrix W that are the principal components of the covariance (that is, the matrix T we are looking for), the diagonal of  that contains the variance explained by the principal components, and the columns of U the principal components. Here's why PCA is always done with SVD. Let's see it now on a real example. Let's test it on the Iris dataset, extracting the first two principal components (that is, passing from a dataset composed by four features to one composed by two): In:from sklearn import datasets iris = datasets.load_iris() X = iris.data y = iris.target print "Iris dataset contains", X.shape[1], "features" pca = PCA(n_components=2) X_pca = pca.fit_transform(X) print "After PCA, it contains", X_pca.shape[1], "features" print "The variance is [% of original]:", sum(pca.explained_variance_ratio_) plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, edgecolors='none') plt.title('First 2 principal components of Iris dataset') plt.show() Out:Iris dataset contains 4 features After PCA, it contains 2 features The variance is [% of original]: 0.977631775025 This is the analysis of the outputs of the process: The explained variance is almost 98% of the original variance from the input. The number of features has been halved, but only 2% of the information is not in the output, hopefully just noise. From a visual inspection, it seems that the different classes, composing the Iris dataset, are separated from each other. This means that a classifier working on such a reduced set will have comparable performance in terms of accuracy, but will be faster to train and run prediction. As a proof of the second point, let's now try to train and test two classifiers, one using the original dataset and another using the reduced set, and print their accuracy: In:from sklearn.linear_model import SGDClassifier from sklearn.cross_validation import train_test_split from sklearn.metrics import accuracy_score def test_classification_accuracy(X_in, y_in): X_train, X_test, y_train, y_test = train_test_split(X_in, y_in, random_state=101, train_size=0.50) clf = SGDClassifier('log', random_state=101) clf.fit(X_train, y_train) return accuracy_score(y_test, clf.predict(X_test)) print "SGDClassifier accuracy on Iris set:", test_classification_accuracy(X, y) print "SGDClassifier accuracy on Iris set after PCA (2 compo-nents):", test_classification_accuracy(X_pca, y) Out:SGDClassifier accuracy on Iris set: 0.586666666667 SGDClassifier accuracy on Iris set after PCA (2 components): 0.72 As you can see, this technique not only reduces the complexity and space of the learner down in the chain, but also helps achieve generalization (exactly as a Ridge or Lasso regularization). Now, if you are unsure how many components should be in the output, typically as a rule of thumb, choose the minimum number that is able to explain at least 90% (or 95%) of the input variance. Empirically, such a choice usually ensures that only the noise is cut off. So far, everything seems perfect: we found a great solution to reduce the number of features, building some with very high predictive power, and we also have a rule of thumb to guess the right number of them. Let's now check how scalable this solution is: we're investigating how it scales when the number of observations and features increases. The first thing to note is that the SVD algorithm, the core piece of PCA, is not stochastic; therefore, it needs the whole matrix in order to be able to extract its principal components. Now, let's see how scalable PCA is in practice on some synthetic datasets with an increasing number of features and observations. We will perform a full (lossless) decomposition (the augment while instantiating the object PCA is None), as asking for a lower number of features doesn't impact the performance (it's just a matter of slicing the output matrixes of SVD). In the following code, we first create matrices with 10 thousand points and 20, 50, 100, 250, 1,000, and 2,500 features to be processed by PCA. Then, we create matrixes with 100 features and 1, 5, 10, 25, 50, and 100 thousands observations to be processed with PCA: In:import time def check_scalability(test_pca): pylab.rcParams['figure.figsize'] = (10, 4) # FEATURES n_points = 10000 n_features = [20, 50, 100, 250, 500, 1000, 2500] time_results = [] for n_feature in n_features: X, _ = make_blobs(n_points, n_features=n_feature, random_state=101) pca = copy.deepcopy(test_pca) tik = time.time() pca.fit(X) time_results.append(time.time()-tik) plt.subplot(1, 2, 1) plt.plot(n_features, time_results, 'o--') plt.title('Feature scalability') plt.xlabel('Num. of features') plt.ylabel('Training time [s]') # OBSERVATIONS n_features = 100 n_observations = [1000, 5000, 10000, 25000, 50000, 100000] time_results = [] for n_points in n_observations: X, _ = make_blobs(n_points, n_features=n_features, random_state=101) pca = copy.deepcopy(test_pca) tik = time.time() pca.fit(X) time_results.append(time.time()-tik) plt.subplot(1, 2, 2) plt.plot(n_observations, time_results, 'o--') plt.title('Observations scalability') plt.xlabel('Num. of training observations') plt.ylabel('Training time [s]') plt.show() check_scalability(PCA(None)) Out: As you can clearly see, PCA based on SVD is not scalable: if the number of features increases linearly, the time needed to train the algorithm increases exponentially. Also, the time needed to process a matrix with a few hundred observations becomes too high and (not shown in the image) the memory consumption makes the problem unfeasible for a domestic computer (with 16 or less GB of RAM).It seems clear that a PCA based on SVD is not the solution for big data: fortunately, in the recent years, many workarounds have been introduced. Summary In this article, we've introduced a popular unsupervised learner able to scale to cope with big data. PCA is able to reduce the number of features by creating ones containing the majority of variance (that is, the principal ones). You can also refer the following books on the similar topics: R Machine Learning Essentials: https://www.packtpub.com/big-data-and-business-intelligence/r-machine-learning-essentials R Machine Learning By Example: https://www.packtpub.com/big-data-and-business-intelligence/r-machine-learning-example Machine Learning with R - Second Edition: https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-r-second-edition Resources for Article: Further resources on this subject: Machine Learning Tasks [article] Introduction to Clustering and Unsupervised Learning [article] Clustering and Other Unsupervised Learning Methods [article]
Read more
  • 0
  • 0
  • 1161
article-image-deep-learning-image-generation-getting-started-generative-adversarial-networks
Mohammad Pezeshki
27 Sep 2016
5 min read
Save for later

Deep Learning and Image generation: Get Started with Generative Adversarial Networks

Mohammad Pezeshki
27 Sep 2016
5 min read
In machine learning, a generative model is one that captures the observable data distribution. The objective of deep neural generative models is to disentangle different factors of variation in data and be able to generate new or similar-looking samples of the data. For example, an ideal generative model on face images disentangles all different factors of variation such as illumination, pose, gender, skin color, and so on, and is also able to generate a new face by the combination of those factors in a very non-linear way. Figure 1 shows a trained generative model that has learned different factors, including pose and the degree of smiling. On the x-axis, as we go to the right, the pose changes and on y-axis as we move upwards, smiles turn to frowns. Usually these factors are orthogonal to one another, meaning that changing one while keeping the others fixed leads to a single change in data space; e.g. in the first row of Figure 1, only the pose changes with no change in the degree of smiling. The figure is adapted from here.   Based on the assumption that these underlying factors of variation have a very simple distribution (unlike the data itself), to generate a new face, we can simply sample a random number from the assumed simple distribution (such as a Gaussian). In other words, if there are k different factors, we randomly sample from a k-dimensional Gaussian distribution (aka noise). In this post, we will take a look at one of the recent models in the area of deep learning and generative models, called generative adversarial network. This model can be seen as a game between two agents: the Generator and the Discriminator. The generator generates images from noise and the discriminator discriminates between real images and those images which are generated by the generator. The objective is then to train the model such that while the discriminator tries to distinguish generated images from real images, the generator tries to fool the discriminator.  To train the model, we need to define a cost. In the case of GAN, the errors made by the discriminator are considered as the cost. Consequently, the objective of the discriminator is to minimize the cost, while the objective for the generator is to fool the discriminator by maximizing the cost. A graphical illustration of the model is shown in Figure 2.   Formally, we define the discriminator as a binary classiffier D : Rm ! f0; 1g and the generator as the mapping G : Rk ! Rm in which k is the dimension of the latent space that represents all of the factors of variation. Denoting the data by x and a point in latent space by z, the model can be trained by playing the following minmax game:   Note that the rst term encourages the discriminator to discriminate between generated images and real ones, while the second term encourages the generator to come up with images that would fool the discriminator. In practice, the log in the second term can be saturated, which would hurt the row of the gradient. As a result, the cost may be reformulated equivalently as:   At the time of generation, we can sample from a simple k-dimensional Gaussian distribution with zero mean and unit variance and pass it onto the generator. Among different models that can be used as the discriminator and generator, we use deep neural networks with parameters D and G for the discriminator and generator, respectively. Since the training boils down to updating the parameters using the backpropagation algorithm, the update rule is defined as follows: If we use a convolutional network as the discriminator and another convolutional network with fractionally strided convolution layers as the generator, the model is called DCGAN (Deep Convolutional Generative Adversarial Network). Some samples of bedroom im-age generation from this model are shown in Figure 3.   The generator can also be a sequential model, meaning that it can generate an image using a sequence of images with lower-resolution or details. A few examples of the generated images using such a model are shown in Figure 4. The GAN and later variants such as the DCGAN are currently considered to be among the best when it comes to the quality of the generated samples. The images look so realistic that you might assume that the model has simply memorized instances of the training set, but a quick KNN search reveals this not to be the case. About the author Mohammad Pezeshk is a master’s student in the LISA lab of Universite de Montreal working under the supervision of Yoshua Bengio and Aaron Courville. He obtained his bachelor's in computer engineering from Amirkabir University of Technology (Tehran Polytechnic) in July 2014 and then started his master’s in September 2014. His research interests lie in the fields of artificial intelligence, machine learning, probabilistic models and specifically deep learning.
Read more
  • 0
  • 0
  • 3002

article-image-language-modeling-with-deep-learning
Mohammad Pezeshki
23 Sep 2016
5 min read
Save for later

Language Modeling with Deep Learning

Mohammad Pezeshki
23 Sep 2016
5 min read
Language modeling is defining a joint probability distribution over a sequence of tokens (words or characters). Considering a sequence of tokens fx1; :::; xT g. A language model defines P (x1; : : : ; xT ), which can be used in many areas of natural language processing. Language modelings define a joint probability distribution over a sequence of tokens (words or characters). Consider a sequence of tokens x1; : : : ; xT. For example, a language model can significantly improve the accuracy of a speech recognition system. As an example, in the case of two words that have the same sound but different meanings, a language model can fix the problem of recognizing the right word. In Figure 1, the speech recognizer (aka acoustic model) has assigned the same high probabilities to the words meet" and meat". It is even possible that the speech recognizer assigns a higher probability to meet" rather than meat". However, by conditioning the language model on the three rst tokens (I-cooked-some"), the next word could be sh", pasta", or meat" with a reasonable probability higher than the probability of meet". To get the final answer, we can simply multiply two tables of probabilities and normalize them. Now the word meat" has a very high relative probability! One family of deep learning models that are capable of modeling sequential data (such as language) is Recurrent Neural Networks (RNNs). RNNs have recently achieved impressive results on different problems such as the language modeling. In this article, we briefly describe RNNs and demonstrate how to code them using the Blocks library on top of Theano. Consider a sequence of T input elements x1; : : : ; xT . RNN models the sequence by applying the same operation in a recursive way. Formally, ht = f(ht   1; xt); (1) yt = g(h t); (2)   Where ht is the internal hidden representation of the RNN and yt is the output at tth time-step. For the very first time-step, we also have an initial state h0. f and g are two functions, which are shared across the time axis. In the simplest case, f and g can be a linear transformation followed by a non-linearity. There are more complicated forms of f and g such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU). Here we skip the exact formulations of f and g to use LSTM as a black box. Consequently, suppose we have B sequences, each with a length of T, such that each time-step is presented in a vector of size F . So the input can be seen as a 3D tensor with size T xBxF, the hidden representation with size T xBxF 0, and the output with size T xBxF 00. Let's build a character-level language model that can model the joint probability P (x1; : : : ; xT ) using the chain rule: P (x1; : : : ; xT ) = P (x1)P (x2jx1)P (x3jx1; x2):::P (xT jx1::T1) (3) We can model P (xtjx1::t1) using an RNN by predicting xt given xt1::1. In other words, given a sequence fx1; : : : ; xT g, the input sequence is fx1; : : : ; xT1g and the target sequence is fx2; : : : ; xT g. To define input and target, we can write: Now to define the model, we need a linear transformation from the input to the LSTM, and from the LSTM to the output. To train the model, we use the cross entropy between the model output and the true target:   Now assuming that data is provided to us, using data stream, we can start training by initializing the model, and tuning parameters: After the model is trained, we can condition the model on an initial sequence and start generating the next token. We can repeatedly feed the predicted token into the model and get the next token. We can even just start from the initial state and ask the model to hallucinate! Here is a sample generated text from a model trained on a 96 MB text data of wikipedia (figure adapted from here): Here is a visualization of the model's output. The first line is the real data and the next six lines are the candidate with the highest output probability of for each character. The more red a cell is, the higher probability the model assigns to that character. For example, as soon as the model sees ttp://ww, it is confident that the next character is also a w" and the next one is a .". Butat this point, there is no more clue about the next character. So the model assigns almost the same probability to all the characters (figure adapted from here): In this post we learned about language modeling and one of its applications in speech recognition. We also learned how to code a recurrent neural network in order to train such a model. You can find the complete code and experiment on a bunch of datasets such as wikipedia at Github. The code is written by my close friend Eloi Zablocki and me. About the author Mohammad Pezeshk is a master’s student in the LISA lab of Universite de Montreal working under the supervision of Yoshua Bengio and Aaron Courville. He obtained his bachelor's in computer engineering from Amirkabir University of Technology (Tehran Polytechnic) in July 2014 and then started his master’s in September 2014. His research interests lie in the fields of artifitial intelligence, machine learning, probabilistic models and specifically deep learning.
Read more
  • 0
  • 0
  • 3346

article-image-indexes-and-constraints
Packt
06 Sep 2016
14 min read
Save for later

Indexes and Constraints

Packt
06 Sep 2016
14 min read
In this article by Manpreet Kaur and Baji Shaik, authors of the book PostgreSQL Development Essentials, we will discuss indexes and constraints, types of indexes and constraints, their use in the real world, and the best practices on when we need to and how to create them. Your application may have different kinds of data, and you will want to maintain data integrity across the database and, at the same time, you need a performance gain as well. This article helps you understand how best you can choose indexes and constraints for your requirement and improve the performance. It also covers real-time examples that will help you understand better. Of course, not all types of indexes or constraints are the best fit for your requirement; however, you can choose the required ones based on how they work. (For more resources related to this topic, see here.) Introduction to indexes and constraints An index is a pointer to the actual rows in its corresponding table. It is used to find and retrieve particular rows much faster than using the standard method. Indexes help you improve the performance of queries. However, indexes get updated on every Data Manipulation Language (DML)—that is, INSERT, UPDATE, and DELETE—query on the table, which is an overhead, so they should be used carefully. Constraints are basically rules restricting the values allowed in the columns and they define certain properties that data in a database must comply with. The purpose of constraints is to maintain the integrity of data in the database. Primary-key indexes As the name indicates, primary key indexes are the primary way to identify a record (tuple) in a table. Obviously, it cannot be null because a null (unknown) value cannot be used to identify a record. So, all RDBMSs prevent users from assigning a null value to the primary key. The primary key index is used to maintain uniqueness in a column. You can have only one primary key index on a table. It can be declared on multiple columns. In the real world, for example, if you take the empid column of the emp table, you will be able to see a primary key index on that as no two employees can have the same empid value. You can add a primary key index in two ways: One is while creating the table and the other, once the table has been created. This is how you can add a primary key while creating a table: CREATE TABLE emp( empid integer PRIMARY KEY, empname varchar, sal numeric); And this is how you can add a primary key after a table has been created: CREATE TABLE emp( empid integer, empname varchar, sal numeric); ALTER TABLE emp ADD PRIMARY KEY(empid); Irrespective of whether a primary key is created while creating a table or after the table is created, a unique index will be created implicitly. You can check unique index through the following command: postgres=# select * from pg_indexes where tablename='emp'; -[ RECORD 1 ]------------------------------------------------------- schemaname | public tablename | emp indexname | emp_pkey tablespace | indexdef | CREATE UNIQUE INDEX emp_pkey ON emp USING btree (empid) Since it maintains uniqueness in the column, what happens if you try to INSERT a duplicate row? And try to INSERT NULL values? Let's check it out: postgres=# INSERT INTO emp VALUES(100, 'SCOTT', '10000.00'); INSERT 0 1 postgres=# INSERT INTO emp VALUES(100, 'ROBERT', '20000.00'); ERROR: duplicate key value violates unique constraint "emp_pkey" DETAIL: Key (empid)=(100) already exists. postgres=# INSERT INTO emp VALUES(null, 'ROBERT', '20000.00'); ERROR: null value in column "empid" violates not-null constraint DETAIL: Failing row contains (null, ROBERT, 20000). So, if empid is a duplicate value, database throws an error as duplicate key violation due to unique constraint and if empid is null, error is violates null constraint due to not-null constraint A primary key is simply a combination of a unique and a not-null constraint. Unique indexes Like a primary key index, a unique index is also used to maintain uniqueness; however, it allows NULL values. The syntax is as follows: CREATE TABLE emp( empid integer UNIQUE, empname varchar, sal numeric); CREATE TABLE emp( empid integer, empname varchar, sal numeric, UNIQUE(empid)); This is what happens if you INSERT NULL values: postgres=# INSERT INTO emp VALUES(100, 'SCOTT', '10000.00'); INSERT 0 1 postgres=# INSERT INTO emp VALUES(null, 'SCOTT', '10000.00'); INSERT 0 1 postgres=# INSERT INTO emp VALUES(null, 'SCOTT', '10000.00'); INSERT 0 1 As you see, it allows NULL values, and they are not even considered as duplicate values. If a unique index is created on a column then there is no need of a standard index on the column. If you do so, it would just be a duplicate of the automatically created index. Currently, only B-Tree indexes can be declared unique. When a primary key is defined, PostgreSQL automatically creates a unique index. You check it out using the following query: postgres=# select * from pg_indexes where tablename ='emp'; -[ RECORD 1 ]----------------------------------------------------- schemaname | public tablename | emp indexname | emp_empid_key tablespace | indexdef | CREATE UNIQUE INDEX emp_empid_key ON emp USING btree (empid) Standard indexes Indexes are primarily used to enhance database performance (though incorrect use can result in slower performance). An index can be created on multiple columns and multiple indexes can be created on one table. The syntax is as follows: CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ name ] ON table_name [ USING method ] ( { column_name | ( expression ) } [ COLLATE collation ] [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] ) [ WITH ( storage_parameter = value [, ... ] ) ] [ TABLESPACE tablespace_name ] [ WHERE predicate ] PostgreSQL supports B-Tree, hash, Generalized Search Tree (GiST), SP-GiST, and Generalized Inverted Index (GIN), which we will cover later in this article. If you do not specify any index type while creating it creates a B-Tree index as default. Full text indexes Let's talk about a document, for example, a magazine article or an e-mail message. Normally, we use a text field to store a document. Searching for content within the document based on a matching pattern is called full text search. It is based on the @@ matching operator. You can check out http://www.postgresql.org/docs/current/static/textsearch-intro.html#TEXTSEARCH-MATCHING for more details. Indexes on such full text fields are nothing but full text indexes. You can create a GIN index to speed up full text searches. We will cover the GIN index later in this article. The syntax is as follows: CREATE TABLE web_page( title text, heading text, body text); CREATE INDEX web_page_title_body ON web_page USING GIN(to_tsvector('english', body)); The preceding commands create a full text index on the body column of a web_page table. Partial indexes Partial indexes are one of the special features of PostgreSQL. Such indexes are not supported by many other RDBMSs. As the name suggests, if an index is created partially, which typically means the subset of a table, then it's a partial index. This subset is defined by a predicate (WHERE clause). Its main purpose is to avoid common values. An index will essentially ignore a value that appears in a large fraction of a table's rows and the search will revert o a full table scan rather than an index scan. Therefore, indexing repeated values just wastes space and incurs the expense of the index updating without getting any performance benefit back at read time. So, common values should be avoided. Let's take a common example, an emp table. Suppose you have a status column in an emp table that shows whether emp exists or not. In any case, you care about the current employees of an organization. In such cases, you can create a partial index in the status column by avoiding former employees. Here is an example: CREATE TABLE emp_table (empid integer, empname varchar, status varchar); INSERT INTO emp_table VALUES(100, 'scott', 'exists'); INSERT INTO emp_table VALUES(100, 'clark', 'exists'); INSERT INTO emp_table VALUES(100, 'david', 'not exists'); INSERT INTO emp_table VALUES(100, 'hans', 'not exists'); To create a partial index that suits our example, we will use the following query CREATE INDEX emp_table_status_idx ON emp_table(status) WHERE status NOT IN('not exists'); Now, let's check the queries that can use the index and those that cannot. A query that uses index is as follows: postgres=# explain analyze select * from emp_table where status='exists'; QUERY PLAN ------------------------------------------------------------------- Index Scan using emp_table_status_idx on emp_table(cost=0.13..6.16 rows=2 width=17) (actual time=0.073..0.075 rows=2 loops=1) Index Cond: ((status)::text = 'exists'::text) A query that will not use the index is as follows: postgres=# explain analyze select * from emp_table where status='not exists'; QUERY PLAN ----------------------------------------------------------------- Seq Scan on emp_table (cost=10000000000.00..10000000001.05 rows=1 width=17) (actual time=0.013..0.014 rows=1 loops=1) Filter: ((status)::text = 'not exists'::text) Rows Removed by Filter: 3 Multicolumn indexes PostgreSQL supports multicolumn indexes. If an index is defined simultaneously in more than one column then it is treated as a multicolumn index. The use case is pretty simple, for example, you have a website where you need a name and date of birth to fetch the required information, and then the query run against the database uses both fields as the predicate(WHERE clause). In such scenarios, you can create an index in both the columns. Here is an example: CREATE TABLE get_info(name varchar, dob date, email varchar); INSERT INTO get_info VALUES('scott', '1-1-1971', '[email protected]'); INSERT INTO get_info VALUES('clark', '1-10-1975', '[email protected]'); INSERT INTO get_info VALUES('david', '11-11-1971', '[email protected]'); INSERT INTO get_info VALUES('hans', '12-12-1971', '[email protected]'); To create a multicolumn index, we will use the following command: CREATE INDEX get_info_name_dob_idx ON get_info(name, dob); A query that uses index is as follows: postgres=# explain analyze SELECT * FROM get_info WHERE name='scott' AND dob='1-1-1971'; QUERY PLAN -------------------------------------------------------------------- Index Scan using get_info_name_dob_idx on get_info (cost=0.13..4.15 rows=1 width=68) (actual time=0.029..0.031 rows=1 loops=1) Index Cond: (((name)::text = 'scott'::text) AND (dob = '1971-01-01'::date)) Planning time: 0.124 ms Execution time: 0.096 ms B-Tree indexes Like most of the relational databases, PostgreSQL also supports B-Tree indexes. Most of the RDBMS systems use B-Tree as the default index type, unless something else is specified explicitly. Basically, this index keeps data stored in a tree (self-balancing) structure. It's a default index in PostgreSQL and fits in the most common situations. The B-Tree index can be used by an optimizer whenever the indexed column is used with a comparison operator, such as<, <=, =, >=, >, and LIKE or ~ operator; however, LIKE or ~ will only be used if the pattern is a constant and anchored to the beginning of the string, for example, my_col LIKE 'mystring%' or my_column ~ '^mystring', but not my_column LIKE '%mystring'. Here is an example: CREATE TABLE emp( empid integer, empname varchar, sal numeric); INSERT INTO emp VALUES(100, 'scott', '10000.00'); INSERT INTO emp VALUES(100, 'clark', '20000.00'); INSERT INTO emp VALUES(100, 'david', '30000.00'); INSERT INTO emp VALUES(100, 'hans', '40000.00'); Create a B-Tree index on the empname column: CREATE INDEX emp_empid_idx ON emp(empid); CREATE INDEX emp_name_idx ON emp(empname); Here are the queries that use index: postgres=# explain analyze SELECT * FROM emp WHERE empid=100; QUERY PLAN --------------------------------------------------------------- Index Scan using emp_empid_idx on emp (cost=0.13..4.15 rows=1 width=68) (actual time=1.015..1.304 rows=4 loops=1) Index Cond: (empid = 100) Planning time: 0.496 ms Execution time: 2.096 ms postgres=# explain analyze SELECT * FROM emp WHERE empname LIKE 'da%'; QUERY PLAN ---------------------------------------------------------------- Index Scan using emp_name_idx on emp (cost=0.13..4.15 rows=1 width=68) (actual time=0.956..0.959 rows=1 loops=1) Index Cond: (((empname)::text >= 'david'::text) AND ((empname)::text < 'david'::text)) Filter: ((empname)::text ~~ 'david%'::text) Planning time: 2.285 ms Execution time: 0.998 ms Here is a query that cannot use index as % is used at the beginning of the string: postgres=# explain analyze SELECT * FROM emp WHERE empname LIKE '%david'; QUERY PLAN --------------------------------------------------------------- Seq Scan on emp (cost=10000000000.00..10000000001.05 rows=1 width=68) (actual time=0.014..0.015 rows=1 loops=1) Filter: ((empname)::text ~~ '%david'::text) Rows Removed by Filter: 3 Planning time: 0.099 ms Execution time: 0.044 ms Hash indexes These indexes can only be used with equality comparisons. So, an optimizer will consider using this index whenever an indexed column is used with = operator. Here is the syntax: CREATE INDEX index_name ON table USING HASH (column); Hash indexes are faster than B-Tree as they should be used if the column in question is never intended to be scanned comparatively with < or > operators. The Hash indexes are not WAL-logged, so they might need to be rebuilt after a database crash, if there were any unwritten changes. GIN and GiST indexes GIN or GiST indexes are used for full text searches. GIN can only be created on the tsvector datatype columns and GIST on tsvector or tsquery database columns. The syntax is follows: CREATE INDEX index_name ON table_name USING GIN (column_name); CREATE INDEX index_name ON table_name USING GIST (column_name); These indexes are useful when a column is queried for specific substrings on a regular basis; however, these are not mandatory. What is the use case and when do we really need it? Let me give an example to explain. Suppose we have a requirement to implement simple search functionality for an application. Say, for example, we have to search through all users in the database. Also, let's assume that we have more than 10 million users currently stored in the database. This search implementation requirement shows that we should be able to search using partial matches and multiple columns, for example, first_name, last_name. More precisely, if we have customers like Mitchell Johnson and John Smith, an input query of John should return for both customers. We can solve this problem using the GIN or GIST indexes. Here is an example: CREATE TABLE customers (first_name text, last_name text); Create GIN/GiST index: CREATE EXTENSION IF NOT EXISTS pg_trgm; CREATE INDEX customers_search_idx_gin ON customers USING gin (first_name gin_trgm_ops, last_name gin_trgm_ops); CREATE INDEX customers_search_idx_gist ON customers USING gist (first_name gist_trgm_ops, last_name gist_trgm_ops); So, what is the difference between these two indexes? This is what the PostgreSQL documentation says: GiST is faster to update and build the index and is less accurate than GIN GIN is slower to update and build the index but is more accurate As per the documentation, the GiST index is lossy. It means that the index might produce false matches, and it is necessary to check the actual table row to eliminate such false matches. (PostgreSQL does this automatically when needed). BRIN indexes Block Range Index (BRIN) indexes are introduced in PostgreSQL 9.5. BRIN indexes are designed to handle very large tables in which certain columns have some natural correlation with their physical location within the table. The syntax is follows: CREATE INDEX index_name ON table_name USING brin(col); Here is a good example: https://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.5#BRIN_Indexes Summary In this article, we looked at different types of indexes and constraints that PostgreSQL supports, how they are used in the real world with examples, and what happens if you violate a constraint. It not only helps you identify the right index for your data, but also improve the performance of the database. Every type of index or constraint has its own identification and need. Some indexes or constraints may not be portable to other RDBMSs, some may not be needed, or sometimes you might have chosen the wrong one for your need. Resources for Article: Further resources on this subject: PostgreSQL in Action [article] PostgreSQL – New Features [article] PostgreSQL Cookbook - High Availability and Replication [article]
Read more
  • 0
  • 0
  • 1976
article-image-preprocessing-data
Packt
16 Aug 2016
5 min read
Save for later

Preprocessing the Data

Packt
16 Aug 2016
5 min read
In this article, by Sampath Kumar Kanthala, the author of the book Practical Data Analysis discusses how to obtain, clean, normalize, and transform raw data into a standard format like CVS or JSON using OpenRefine. In this article we will cover: Data Scrubbing Statistical methods Text Parsing Data Transformation (For more resources related to this topic, see here.) Data scrubbing Scrubbing data also called data cleansing, is the process of correcting or removing data in a dataset that is incorrect, inaccurate, incomplete, improperly formatted, or duplicated. The result of the data analysis process not only depends on the algorithms, it depends on the quality of the data. That's why the next step after obtaining the data, is the data scrubbing. In order to avoid dirty data our dataset should possess the following characteristics: Correct Completeness Accuracy Consistency Uniformity Dirty data can be detected by applying some simple statistical data validation also by parsing the texts or deleting duplicate values. Missing or sparse data can lead you to highly misleading results. Statistical methods In this method we need some context about the problem (knowledge domain) to find values that are unexpected and thus erroneous, even if the data type match but the values are out of the range, it can be resolved by setting the values to an average or mean value. Statistical validations can be used to handle missing values which can be replaced by one or more probable values using Interpolation or by reducing the data set using decimation. Mean: Value calculated by summing up all values and then dividing by the number of values. Median: The median is defined as the value where 50% of values in a range will be below, 50% of values above the value. Range constraints: Numbers or dates should fall within a certain range. That is, they have minimum and/or maximum possible values. Clustering: Usually, when we obtain data directly from the user some values include ambiguity or refer to the same value with a typo. For example, "Buchanan Deluxe 750ml 12x01 "and "Buchanan Deluxe 750ml   12x01." which are different only by a "." or in the case of "Microsoft" or "MS" instead of "Microsoft Corporation" which refer to the same company and all values are valid. In those cases, grouping can help us to get accurate data and eliminate duplicated enabling a faster identification of unique values. Text parsing We perform parsing to help us to validate if a string of data is well formatted and avoid syntax errors. Regular expression patterns usually, text fields would have to be validated this way. For example, dates, e-mail, phone numbers, and IP address. Regex is a common abbreviation for "regular expression"): In Python we will use re module to implement regular expressions. We can perform text search and pattern validations. First, we need to import the re module. import re In the follow examples, we will implement three of the most common validations (e-mail, IP address, and date format). E-mail validation: myString = 'From: [email protected] (readers email)' result = re.search('([w.-]+)@([w.-]+)', myString) if result: print (result.group(0)) print (result.group(1)) print (result.group(2)) Output: >>> [email protected] >>> readers >>> packt.com The function search() scans through a string, searching for any location where the Regex matches. The function group() helps us to return the string matched by the Regex. The pattern w matches any alphanumeric character and is equivalent to the class [a-zA-Z0-9_]. IP address validation: isIP = re.compile('d{1,3}.d{1,3}.d{1,3}.d{1,3}') myString = " Your IP is: 192.168.1.254 " result = re.findall(isIP,myString) print(result) Output: >>> 192.168.1.254 The function findall() finds all the substrings where the Regex matches, and returns them as a list. The pattern d matches any decimal digit, is equivalent to the class [0-9]. Date format: myString = "01/04/2001" isDate = re.match('[0-1][0-9]/[0-3][0-9]/[1-2][0-9]{3}', myString) if isDate: print("valid") else: print("invalid") Output: >>> 'valid' The function match() finds if the Regex matches with the string. The pattern implements the class [0-9] in order to parse the date format. For more information about regular expressions: http://docs.python.org/3.4/howto/regex.html#regex-howto Data transformation Data transformation is usually related with databases and data warehouse where values from a source format are extract, transform, and load in a destination format. Extract, Transform, and Load (ETL) obtains data from data sources, performs some transformation function depending on our data model and loads the result data into destination. Data extraction allows us to obtain data from multiple data sources, such as relational databases, data streaming, text files (JSON, CSV, XML), and NoSQL databases. Data transformation allows us to cleanse, convert, aggregate, merge, replace, validate, format, and split data. Data loading allows us to load data into destination format, like relational databases, text files (JSON, CSV, XML), and NoSQL databases. In statistics data transformation refers to the application of a mathematical function to the dataset or time series points. Summary In this article, we explored the common data sources and implemented a web scraping example. Next, we introduced the basic concepts of data scrubbing like statistical methods and text parsing. Resources for Article:   Further resources on this subject: MicroStrategy 10 [article] Expanding Your Data Mining Toolbox [article] Machine Learning Using Spark MLlib [article]
Read more
  • 0
  • 0
  • 1034

article-image-application-logging
Packt
11 Aug 2016
8 min read
Save for later

Application Logging

Packt
11 Aug 2016
8 min read
In this article by Travis Marlette, author of Splunk Best Practices, will cover the following topics: (For more resources related to this topic, see here.) Log messengers Logging formats Within the working world of technology, there are hundreds of thousands of different applications, all (usually) logging in different formats. As Splunk experts, our job is make all those logs speak human, which is often the impossible task. With third-party applications that provide support, sometimes log formatting is out of our control. Take for instance, Cisco or Juniper, or any other leading application manufacturer. We won't be discussing these kinds of logs in this article, but instead the logs that we do have some control over. The logs I am referencing belong to proprietary in-house (also known as "home grown") applications that are often part of the middle-ware, and usually they control some of the most mission-critical services an organization can provide. Proprietary applications can be written in any language, however logging is usually left up to the developers for troubleshooting and up until now the process of manually scraping log files to troubleshoot quality assurance issues and system outages has been very specific. I mean that usually, the developer(s) are the only people that truly understand what those log messages mean. That being said, oftentimes developers write their logs in a way that they can understand them, because ultimately it will be them doing the troubleshooting / code fixing when something breaks severely. As an IT community, we haven't really started taking a look at the way we log things, but instead we have tried to limit the confusion to developers, and then have them help other SME's that provide operational support to understand what is actually happening. This method is successful, however it is slow, and the true value of any SME is reducing any system’s MTTR, and increasing uptime. With any system, the more transactions processed means the larger the scale of a system, which means that, after about 20 machines, troubleshooting begins to get more complex and time consuming with a manual process. This is where something like Splunk can be extremely valuable, however Splunk is only as good as the information that is coming into it. I will say this phrase for the people who haven't heard it yet; "garbage in… garbage out". There are some ways to turn proprietary logging into a powerful tool, and I have personally seen the value of these kinds of logs, after formatting them for Splunk, they turn into a huge asset in an organization’s software life cycle. I'm not here to tell you this is easy, but I am here to give you some good practices about how to format proprietary logs. To do that I'll start by helping you appreciate a very silent and critical piece of the application stack. To developers, a logging mechanism is a very important part of the stack, and the log itself is mission critical. What we haven't spent much time thinking about before log analyzers, is how to make log events/messages/exceptions more machine friendly so that we can socialize the information in a system like Splunk, and start to bridge the knowledge gap between development and operations. The nicer we format the logs, the faster Splunk can reveal the information about our systems, saving everyone time and from headaches. Loggers Here I'm giving some very high level information on loggers. My intention is not to recommend logging tools, but simply to raise awareness of their existence for those that are not in development, and allow for independent research into what they do. With the right developer, and the right Splunker, the logger turns into something immensely valuable to an organization. There is an array of different loggers in the IT universe, and I'm only going to touch on a couple of them here. Keep in mind that I'm only referencing these due to the ease of development I've seen from personal experience, and experiences do vary. I'm only going to touch on three loggers and then move on to formatting, as there are tons of logging mechanisms and the preference truly depends on the developer. Anatomy of a log I'm going to be taking some very broad strokes with the following explanations in order to familiarize you, the Splunker, with the logger. If you would like to learn more information, please either seek out a developer to help you understand the logic better or acquire some education how to develop and log in independent study. There are some pretty basic components to logging that we need to understand to learn which type of data we are looking at. I'll start with the four most common ones: Log events: This is the entirety of the message we see within a log, often starting with a timestamp. The event itself contains all other aspects of application behavior such as fields, exceptions, messages, and so on… think of this as the "container" if you will, for information. Messages: These are often made by the developer of the application and provide some human insight into what's actually happening within an application. The most common messages we see are things like unauthorized login attempt <user> or Connection Timed out to <ip address>. Message Fields: These are the pieces of information that give us the who, where, and when types of information for the application’s actions. They are handed to the logger by the application itself as it either attempts or completes an activity. For instance, in the log event below, the highlighted pieces are what would be fields, and often those that people look for when troubleshooting: "2/19/2011 6:17:46 AM Using 'xplog70.dll' version '2009.100.1600' to execute extended store procedure 'xp_common_1' operation failed to connect to 'DB_XCUTE_STOR'" Exceptions: These are the uncommon, but very important pieces of the log. They are usually only written when something went wrong, and offer developer insight into the root cause at the application layer. They are usually only printed when an error occurs, and used for debugging. These exceptions can print a huge amount of information into the log depending on the developer and the framework. The format itself is not easy and in some cases not even possible for a developer to manage. Log4* This is an open source logger that is often used in middleware applications. Pantheios This is a logger popularly used for Linux, and popular for its performance and multi-threaded handling of logging. Commonly, Pantheios is used for C/C++ applications, but it works with a multitude of frameworks. Logging – logging facility for Python This is a logger specifically for Python, and since Python is becoming more and more popular, this is a very common package used to log Python scripts and applications. Each one of these loggers has their own way of logging, and the value is determined by the application developer. If there is no standardized logging, then one can imagine the confusion this can bring to troubleshooting. Example of a structured log This is an example of a Java exception in a structured log format: When Java prints this exception, it will come in this format and a developer doesn't control what that format is. They can control some aspects about what is included within an exception, though the arrangement of the characters and how it's written is done by the Java framework itself. I mention this last part in order to help operational people understand where the control of a developer sometimes ends. My own personal experience has taught me that attempting to change a format that is handled within the framework itself is an attempt at futility. Pick your battles right? As a Splunker, you can save yourself a headache on this kind of thing. Summary While I say that, I will add an addendum by saying that Splunk, mixed with a Splunk expert and the right development resources, can also make the data I just mentioned extremely valuable. It will likely not happen as fast as they make it out to be at a presentation, and it will take more resources than you may have thought, however at the end of your Splunk journey, you will be happy. This article was to help you understand the importance of logs formatting, and how logs are written. We often don't think about our logs proactively, and I encourage you to do so. Resources for Article: Further resources on this subject: Logging and Monitoring [Article] Logging and Reports [Article] Using Events, Interceptors, and Logging Services [Article]
Read more
  • 0
  • 0
  • 1259