Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-ireport-netbeans
Packt
19 Mar 2010
3 min read
Save for later

iReport in NetBeans

Packt
19 Mar 2010
3 min read
Creating different types of reports inside the NetBeans IDE The first step is to download the NetBeans IDE and the iReport plugin for this. The iReport plugin for NetBeans is available for free download at the following locations: https://sourceforge.net/projects/ireport/files or http://plugins.netbeans.org/PluginPortal/faces/PluginDetailPage.jsp?pluginid=4425 After downloading the plugin, follow the listed steps to install the plugin in NetBeans: Start the NetBeans IDE. Go to Tools | Plugins. Select the Downloaded tab. Press Add Plugins…. Select the plugin files. For iReport 3.7.0 the plugins are: iReport-3.7.0.nbm, jasperreports-components-plugin-3.7.0.nbm, jasperreportsextensions-plugin-3.7.0.nbm, and jasperserver-plugin-3.7.0.nbm. After opening the plugin files you will see the following screen: Check the Install checkbox of ireport-designer, and press the Install button at the bottom of the window. The following screen will appear: Press Next >, and accept the terms of the License Agreement. If the Verify Certificate dialog box appears, click Continue. Press Install, and wait for the installer to complete the installation. After the installation is done, press Finish and close the Plugins dialog. If the IDE requests for a restart, then do it. Now the IDE is ready for creating reports. Creating reports We have already learnt about creating various types of reports, such as reports without parameters, reports with parameters, reports with variables, subreports, crosstab reports, reports with charts and images, and so on. We have also attained knowledge associated with these types of reports. Now, we will learn quickly how to create these reports using NetBeans with the help of the installed iReport plugins. Creating a NetBeans database JDBC connection The first step is to create a database connection, which will be used by the report data sources. Follow the listed steps: Select the Services tab from the left side of the project window. Select Databases. Right-click on Databases, and press New Connection…. In the New Database Connection dialog, set the following under Basic setting, and check the Remember password checkbox: Option Value Driver Name MySQL (Connector/J Driver) Host localhost Port 3306 Database inventory User Name root Password packt Press OK. Now the connection is created, and you can see this under the Services | Databases section, as shown in the following screenshot: Creating a report data source The NetBeans database JDBC connection created previously will be used by a report data source that will be used by the report. Follow the listed steps to create the data source: From the NetBeans toolbar, press the Report Datasources button. You will see the following dialog box: Press New. Select NetBeans Database JDBC connection, and press Next >. Enter inventory in the Name field, and from the Connection drop-down list, select jdbc:mysql://localhost:3306/inventory [root on Default schema]. Press Test, and if the connection is successful, press Save and close the Connections / Datasources dialog box.
Read more
  • 0
  • 0
  • 19229

article-image-distributed-tensorflow-multiple-gpu-server
Kunal Chaudhari
07 May 2018
12 min read
Save for later

Distributed TensorFlow: Working with multiple GPUs and servers

Kunal Chaudhari
07 May 2018
12 min read
Some neural networks models are so large they cannot fit in memory of a single device (GPU). Such models need to be split over many devices, carrying out the training in parallel on the devices. This means anyone can now scale out distributed training to 100s of GPUs using TensorFlow. But that’s not the only advantage of distributed TensorFlow, you can also massively reduce your experimentation time by running many experiments in parallel on many GPUs and servers. Today, we will discuss about distributed TensorFlow and present a number of recipes to work with TensorFlow, GPUs, and multiple servers. Working with TensorFlow and GPUs We will learn how to use TensorFlow with GPUs: the operation performed is a simple matrix multiplication either on CPU or on GPU. Getting ready The first step is to install a version of TensorFlow that supports GPUs. The official TensorFlow Installation Instruction is your starting point . Remember that you need to have an environment supporting GPUs either via CUDA or CuDNN. How to do it... We proceed with the recipe as follows: Start by importing a few modules import sys import numpy as np import tensorflow as tf from datetime import datetime Get from command line the type of processing unit that you desire to use (either "gpu" or "cpu") device_name = sys.argv[1] # Choose device from cmd line. Options: gpu or cpu shape = (int(sys.argv[2]), int(sys.argv[2])) if device_name == "gpu": device_name = "/gpu:0" else: device_name = "/cpu:0" Execute the matrix multiplication either on GPU or on CPU. The key instruction is with tf.device(device_name). It creates a new context manager, telling TensorFlow to perform those actions on either the GPU or the CPU with tf.device(device_name): random_matrix = tf.random_uniform(shape=shape, minval=0, maxval=1) dot_operation = tf.matmul(random_matrix, tf.transpose(random_matrix)) sum_operation = tf.reduce_sum(dot_operation) startTime = datetime.now() with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as session: result = session.run(sum_operation) print(result)  Print some debug timing just to verify what is the difference between CPU and GPU print("Shape:", shape, "Device:", device_name) print("Time taken:", datetime.now() - startTime) How it works... This recipe explains how to assign TensorFlow computations either to CPUs or to GPUs. The code is pretty simple and it will be used as a basis for the next recipe. Playing with Distributed TensorFlow: multiple GPUs and one CPU We will show an example of data parallelism where data is split across multiple GPUs Getting ready This recipe is inspired by a good blog posting written by Neil Tenenholtz and available online: https://clindatsci.com/blog/2017/5/31/distributed-tensorflow How to do it... We proceed with the recipe as follows: Consider this piece of code which runs a matrix multiplication on a single GPU. # single GPU (baseline) import tensorflow as tf # place the initial data on the cpu with tf.device('/cpu:0'): input_data = tf.Variable([[1., 2., 3.], [4., 5., 6.], [7., 8., 9.], [10., 11., 12.]]) b = tf.Variable([[1.], [1.], [2.]]) # compute the result on the 0th gpu with tf.device('/gpu:0'): output = tf.matmul(input_data, b) # create a session and run with tf.Session() as sess: sess.run(tf.global_variables_initializer()) print sess.run(output) Partition the code with in graph replication as in the following snippet between 2 different GPUs. Note that the CPU is acting as the master node distributing the graph and collecting the final results. # in-graph replication import tensorflow as tf num_gpus = 2 # place the initial data on the cpu with tf.device('/cpu:0'): input_data = tf.Variable([[1., 2., 3.], [4., 5., 6.], [7., 8., 9.], [10., 11., 12.]]) b = tf.Variable([[1.], [1.], [2.]]) # split the data into chunks for each gpu inputs = tf.split(input_data, num_gpus) outputs = [] # loop over available gpus and pass input data for i in range(num_gpus): with tf.device('/gpu:'+str(i)): outputs.append(tf.matmul(inputs[i], b)) # merge the results of the devices with tf.device('/cpu:0'): output = tf.concat(outputs, axis=0) # create a session and run with tf.Session() as sess: sess.run(tf.global_variables_initializer()) print sess.run(output) How it works... This is a very simple recipe where the graph is split in two parts by the CPU acting as master and distributed to two GPUs acting as distributed workers. The result of the computation is collected back to the CPU. Playing with Distributed TensorFlow: multiple servers We will learn how to distribute a TensorFlow computation across multiple servers. The key assumption is that the code is same for both the workers and the parameter servers. Therefore the role of each computation node is passed by a command line argument. Getting ready Again, this recipe is inspired by a good blog posting written by Neil Tenenholtz and available online: https://clindatsci.com/blog/2017/5/31/distributed-tensorflow How to do it... We proceed with the recipe as follows: Consider this piece of code where we specify the cluster architecture with one master running on 192.168.1.1:1111 and two workers running on 192.168.1.2:1111 and 192.168.1.3:1111 respectively. import sys import tensorflow as tf # specify the cluster's architecture cluster = tf.train.ClusterSpec({'ps': ['192.168.1.1:1111'], 'worker': ['192.168.1.2:1111', '192.168.1.3:1111'] }) Note that the code is replicated on multiple machines and therefore it is important to know what is the role of the current execution node. This information we get from the command line. A machine can be either a worker or a parameter server (ps). # parse command-line to specify machine job_type = sys.argv[1] # job type: "worker" or "ps" task_idx = sys.argv[2] # index job in the worker or ps list # as defined in the ClusterSpec Run the training server where given a cluster, we bless each computational with a role (either worker or ps), and an id. # create TensorFlow Server. This is how the machines communicate. server = tf.train.Server(cluster, job_name=job_type, task_index=task_idx) The computation is different according to the role of the specific computation node: If the role is a parameter server, then the condition is to join the server. Note that in this case there is no code to execute because the workers will continuously push updates and the only thing that the Parameter Server has to do is waiting. Otherwise the worker code is executed on a specific device within the cluster. This part of code is similar to the one executed on a single machine where we first build the model and then we train it locally. Note that all the distribution of the work and the collection of the updated results is done transparently by Tensoflow. Note that TensorFlow provides a convenient tf.train.replica_device_setter that automatically assigns operations to devices. # parameter server is updated by remote clients. # will not proceed beyond this if statement. if job_type == 'ps': server.join() else: # workers only with tf.device(tf.train.replica_device_setter( worker_device='/job:worker/task:'+task_idx, cluster=cluster)): # build your model here as if you only were using a single machine with tf.Session(server.target): # train your model here How it works... We have seen how to create a cluster with multiple computation nodes. A node can be either playing the role of a Parameter server or playing the role of a worker. In both cases the code executed is the same but the execution of the code is different according to parameters collected from the command line. The parameter server only needs to wait until the workers send updates. Note that tf.train.replica_device_setter(..) takes the role of automatically assigning operations to available devices, while tf.train.ClusterSpec(..) is used for cluster setup. There is more... An example of distributed training for MNIST is available online. In addition, to that you can decide to have more than one parameter server for efficiency reasons. Using parameters the server can provide better network utilization, and it allows to scale models to more parallel machines. It is possible to allocate more than one parameter server. The interested reader can have a look in here. Training a Distributed TensorFlow MNIST classifier In this we trained a full MNIST classifier in a distributed way. This recipe is inspired by the blog post in http://ischlag.github.io/2016/06/12/async-distributed- tensorflow/ and the code running on TensorFlow 1.2 is available here https://github. com/ischlag/distributed-tensorflow-example Getting ready This recipe is based on the previous one. So it might be convenient to read them in order. How to do it... We proceed with the recipe as follows: Import a few standard modules and define the TensorFlow cluster where the computation is run. Then start a server for a specific task import tensorflow as tf import sys import time # cluster specification parameter_servers = ["pc-01:2222"] workers = [ "pc-02:2222", "pc-03:2222", "pc-04:2222"] cluster = tf.train.ClusterSpec({"ps":parameter_servers, "worker":workers}) # input flags tf.app.flags.DEFINE_string("job_name", "", "Either 'ps' or 'worker'") tf.app.flags.DEFINE_integer("task_index", 0, "Index of task within the job")FLAGS = tf.app.flags.FLAGS # start a server for a specific task server = tf.train.Server( cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index) Read MNIST data and define the hyperparameters used for training # config batch_size = 100 learning_rate = 0.0005 training_epochs = 20 logs_path = "/tmp/mnist/1" # load mnist data set from tensorflow.examples.tutorials.mnist import input_data mnist = input_data.read_data_sets('MNIST_data', one_hot=True) Check if your role is Parameter Server or Worker. If worker then define a simple dense neural network, define an optimizer, and the metric used for evaluating the classifier (for example accuracy). if FLAGS.job_name == "ps": server.join() elif FLAGS.job_name == "worker": # Between-graph replication with tf.device(tf.train.replica_device_setter( worker_device="/job:worker/task:%d" % FLAGS.task_index, cluster=cluster)): # count the number of updates global_step = tf.get_variable( 'global_step', [], initializer = tf.constant_initializer(0), trainable = False) # input images with tf.name_scope('input'): # None -> batch size can be any size, 784 -> flattened mnist image x = tf.placeholder(tf.float32, shape=[None, 784], name="x-input") # target 10 output classes y_ = tf.placeholder(tf.float32, shape=[None, 10], name="y-input") # model parameters will change during training so we use tf.Variable tf.set_random_seed(1) with tf.name_scope("weights"): W1 = tf.Variable(tf.random_normal([784, 100])) W2 = tf.Variable(tf.random_normal([100, 10])) # bias with tf.name_scope("biases"): b1 = tf.Variable(tf.zeros([100])) b2 = tf.Variable(tf.zeros([10])) # implement model with tf.name_scope("softmax"): # y is our prediction z2 = tf.add(tf.matmul(x,W1),b1) a2 = tf.nn.sigmoid(z2) z3 = tf.add(tf.matmul(a2,W2),b2) y = tf.nn.softmax(z3) # specify cost function with tf.name_scope('cross_entropy'): # this is our cost cross_entropy = tf.reduce_mean( -tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1])) # specify optimizer with tf.name_scope('train'): # optimizer is an "operation" which we can execute in a session grad_op = tf.train.GradientDescentOptimizer(learning_rate) train_op = grad_op.minimize(cross_entropy, global_step=global_step) with tf.name_scope('Accuracy'): # accuracy correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1)) accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) # create a summary for our cost and accuracy tf.summary.scalar("cost", cross_entropy) tf.summary.scalar("accuracy", accuracy) # merge all summaries into a single "operation" which we can execute in a session summary_op = tf.summary.merge_all() init_op = tf.global_variables_initializer() print("Variables initialized ...") Start a supervisor which acts as a Chief machine for the distributed setting. The chief is the worker machine which manages all the rest of the cluster. The session is maintained by the chief and the key instruction is sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0)). Also, with prepare_or_wait_for_session(server.target) the supervisor will wait for the model to be ready for use. Note that each worker will take care of different batched models and the final model is then available for the chief. sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0), begin_time = time.time() frequency = 100 with sv.prepare_or_wait_for_session(server.target) as sess: # create log writer object (this will log on every machine) writer = tf.summary.FileWriter(logs_path, graph=tf.get_default_graph()) # perform training cycles start_time = time.time() for epoch in range(training_epochs): # number of batches in one epoch batch_count = int(mnist.train.num_examples/batch_size) count = 0 for i in range(batch_count): batch_x, batch_y = mnist.train.next_batch(batch_size) # perform the operations we defined earlier on batch _, cost, summary, step = sess.run( [train_op, cross_entropy, summary_op, global_step], feed_dict={x: batch_x, y_: batch_y}) writer.add_summary(summary, step) count += 1 if count % frequency == 0 or i+1 == batch_count: elapsed_time = time.time() - start_time start_time = time.time() print("Step: %d," % (step+1), " Epoch: %2d," % (epoch+1), " Batch: %3d of %3d," % (i+1, batch_count), " Cost: %.4f," % cost, "AvgTime:%3.2fms" % float(elapsed_time*1000/frequency)) count = 0 print("Test-Accuracy: %2.2f" % sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels})) print("Total Time: %3.2fs" % float(time.time() - begin_time)) print("Final Cost: %.4f" % cost) sv.stop() print("done") How it works... This recipe describes an example of distributed MNIST classifier. In this example, TensorFlow allows us to define a cluster of three machines. One acts as parameter server and two more machines are used as workers working on separate batches of the training data. If you enjoyed this excerpt, check out the book TensorFlow 1.x Deep Learning Cookbook, to become an expert in implementing deep learning techniques in real-world applications. Emoji Scavenger Hunt showcases TensorFlow.js The 5 biggest announcements from TensorFlow Developer Summit 2018 Setting up Logistic Regression model using TensorFlow  
Read more
  • 0
  • 0
  • 18870

article-image-customizing-deep-learning-models-keras
Amey Varangaonkar
22 Dec 2017
8 min read
Save for later

2 ways to customize your deep learning models with Keras

Amey Varangaonkar
22 Dec 2017
8 min read
[box type="note" align="" class="" width=""]The following extract is taken from the book Deep Learning with Keras, co-authored by Antonio Gulli and Sujit Pal. [/box] Keras has a lot of built-in functionality for you to build all your deep learning models without much need for customization. In this article, the authors explain how your Keras models can be customized for better and more efficient deep learning. As you will recall, Keras is a high level API that delegates to either a TensorFlow or Theano backend for the computational heavy lifting. Any code you build for your customization will call out to one of these backends. In order to keep your code portable across the two backends, your custom code should use the Keras backend API (https://keras.io/backend/), which provides a set of functions that act like a facade over your chosen backend. Depending on the backend selected, the call to the backend facade will translate to the appropriate TensorFlow or Theano call. The full list of functions available and their detailed descriptions can be found on the Keras backend page. In addition to portability, using the backend API also results in more maintainable code, since Keras code is generally more high-level and compact compared to equivalent TensorFlow or Theano code. In the unlikely case that you do need to switch to using the backend directly, your Keras components can be used directly inside TensorFlow (not Theano though) code as described in this Keras blog (https://blog.keras.io/keras-as-a-simplified-interface-to-tensorflow-tutorial.html) Customizing Keras typically means writing your own custom layer or custom distance function. In this section, we will demonstrate how to build some simple Keras layers. You will see more examples of using the backend functions to build other custom Keras components, such as objectives (loss functions), in subsequent sections. Keras example — using the lambda layer Keras provides a lambda layer; it can wrap a function of your choosing. For example, if you wanted to build a layer that squares its input tensor element-wise, you can say simply: model.add(lambda(lambda x: x ** 2)) You can also wrap functions within a lambda layer. For example, if you want to build a custom layer that computes the element-wise euclidean distance between two input tensors, you would define the function to compute the value itself, as well as one that returns the output shape from this function, like so: def euclidean_distance(vecs): x, y = vecs return K.sqrt(K.sum(K.square(x - y), axis=1, keepdims=True)) def euclidean_distance_output_shape(shapes): shape1, shape2 = shapes return (shape1[0], 1) You can then call these functions using the lambda layer shown as follows: lhs_input = Input(shape=(VECTOR_SIZE,)) lhs = dense(1024, kernel_initializer="glorot_uniform", activation="relu")(lhs_input) rhs_input = Input(shape=(VECTOR_SIZE,)) rhs = dense(1024, kernel_initializer="glorot_uniform", activation="relu")(rhs_input) sim = lambda(euclidean_distance, output_shape=euclidean_distance_output_shape)([lhs, rhs]) Keras example - building a custom normalization layer While the lambda layer can be very useful, sometimes you need more control. As an example, we will look at the code for a normalization layer that implements a technique called local response normalization. This technique normalizes the input over local input regions, but has since fallen out of favor because it turned out not to be as effective as other regularization methods such as dropout and batch normalization, as well as better initialization methods. Building custom layers typically involves working with the backend functions, so it involves thinking about the code in terms of tensors. As you will recall, working with tensors is a two step process. First, you define the tensors and arrange them in a computation graph, and then you run the graph with actual data. So working at this level is harder than working in the rest of Keras. The Keras documentation has some guidelines for building custom layers (https://keras.io/layers/writing-your-own-keras-layers/), which you should definitely read. One of the ways to make it easier to develop code in the backend API is to have a small test harness that you can run to verify that your code is doing what you want it to do. Here is a small harness I adapted from the Keras source to run your layer against some input and return a result: from keras.models import Sequential from keras.layers.core import Dropout, Reshape def test_layer(layer, x): layer_config = layer.get_config() layer_config["input_shape"] = x.shape layer = layer.__class__.from_config(layer_config) model = Sequential() model.add(layer) model.compile("rmsprop", "mse") x_ = np.expand_dims(x, axis=0) return model.predict(x_)[0] And here are some tests with layer objects provided by Keras to make sure that the harness runs okay: from keras.layers.core import Dropout, Reshape from keras.layers.convolutional import ZeroPadding2D import numpy as np x = np.random.randn(10, 10) layer = Dropout(0.5) y = test_layer(layer, x) assert(x.shape == y.shape) x = np.random.randn(10, 10, 3) layer = ZeroPadding2D(padding=(1,1)) y = test_layer(layer, x) assert(x.shape[0] + 2 == y.shape[0]) assert(x.shape[1] + 2 == y.shape[1]) x = np.random.randn(10, 10) layer = Reshape((5, 20)) y = test_layer(layer, x) assert(y.shape == (5, 20)) Before we begin building our local response normalization layer, we need to take a moment to understand what it really does. This technique was originally used with Caffe, and the Caffe documentation (http://caffe.berkeleyvision.org/tutorial/layers/lrn.html), describes it as a kind of lateral inhibition that works by normalizing over local input regions. In ACROSS_CHANNEL mode, the local regions extend across nearby channels but have no spatial extent. In WITHIN_CHANNEL mode, the local regions extend spatially, but are in separate channels. We will implement the WITHIN_CHANNEL model as follows. The formula for local response normalization in the WITHIN_CHANNEL model is given by: The code for the custom layer follows the standard structure. The __init__ method is used to set the application specific parameters, that is, the hyperparameters associated with the layer. Since our layer only does a forward computation and doesn't have any learnable weights, all we do in the build method is to set the input shape and delegate to the superclass's build method, which takes care of any necessary book-keeping. In layers where learnable weights are involved, this method is where you would set the initial values. The call method does the actual computation. Notice that we need to account for dimension ordering. Another thing to note is that the batch size is usually unknown at design times, so you need to write your operations so that the batch size is not explicitly invoked. The computation itself is fairly straightforward and follows the formula closely. The sum in the denominator can also be thought of as average pooling over the row and column dimension with a padding size of (n, n) and a stride of (1, 1). Because the pooled data is averaged already, we no longer need to divide the sum by n. The last part of the class is the get_output_shape_for method. Since the layer normalizes each element of the input tensor, the output size is identical to the input size: from keras import backend as K from keras.engine.topology import Layer, InputSpec class LocalResponseNormalization(Layer): def __init__(self, n=5, alpha=0.0005, beta=0.75, k=2, **kwargs): self.n = n self.alpha = alpha self.beta = beta self.k = k super(LocalResponseNormalization, self).__init__(**kwargs) def build(self, input_shape): self.shape = input_shape super(LocalResponseNormalization, self).build(input_shape) def call(self, x, mask=None): if K.image_dim_ordering == "th": _, f, r, c = self.shape Else: _, r, c, f = self.shape squared = K.square(x) pooled = K.pool2d(squared, (n, n), strides=(1, 1), padding="same", pool_mode="avg") if K.image_dim_ordering == "th": summed = K.sum(pooled, axis=1, keepdims=True) averaged = self.alpha * K.repeat_elements(summed, f, axis=1) Else: summed = K.sum(pooled, axis=3, keepdims=True) averaged = self.alpha * K.repeat_elements(summed, f, axis=3) denom = K.pow(self.k + averaged, self.beta) return x / denom def get_output_shape_for(self, input_shape): return input_shape You can test this layer during development using the test harness we described here. It is easier to run this instead of trying to build a whole network to put this into, or worse, waiting till you have fully specified the layer before running it: x = np.random.randn(225, 225, 3) layer = LocalResponseNormalization() y = test_layer(layer, x) assert(x.shape == y.shape) Now that you have a good idea of how to build a custom Keras layer, you might find it instructive to look at Keunwoo Choi's melspectogram (https://keunwoochoi.wordpress.com/2016/11/18/for-beginners-writing-a-custom-keras-layer/) Though building custom Keras layers seems to be fairly commonplace for experienced Keras developers, but they may not be widely useful in a general context. Custom layers are usually built to serve a specific narrow purpose, depending on the use-case in question, and Keras gives you enough flexibility to do so with ease. If you found our post useful, make sure to check out our best selling title Deep Learning with Keras, for other intriguing deep learning concepts and their implementation using Keras.    
Read more
  • 0
  • 1
  • 18359
Visually different images

article-image-threejs-materials-and-texture
Packt
06 Feb 2015
11 min read
Save for later

Three.js - Materials and Texture

Packt
06 Feb 2015
11 min read
In this article by Jos Dirksen author of the book Three.js Cookbook, we will learn how Three.js offers a large number of different materials and supports many different types of textures. These textures provide a great way to create interesting effects and graphics. In this article, we'll show you recipes that allow you to get the most out of these components provided by Three.js. (For more resources related to this topic, see here.) Using HTML canvas as a texture Most often when you use textures, you use static images. With Three.js, however, it is also possible to create interactive textures. In this recipe, we will show you how you can use an HTML5 canvas element as an input for your texture. Any change to this canvas is automatically reflected after you inform Three.js about this change in the texture used on the geometry. Getting ready For this recipe, we need an HTML5 canvas element that can be displayed as a texture. We can create one ourselves and add some output, but for this recipe, we've chosen something else. We will use a simple JavaScript library, which outputs a clock to a canvas element. The resulting mesh will look like this (see the 04.03-use-html-canvas-as-texture.html example): The JavaScript used to render the clock was based on the code from this site: http://saturnboy.com/2013/10/html5-canvas-clock/. To include the code that renders the clock in our page, we need to add the following to the head element: <script src="../libs/clock.js"></script> How to do it... To use a canvas as a texture, we need to perform a couple of steps: The first thing we need to do is create the canvas element: var canvas = document.createElement('canvas'); canvas.width=512; canvas.height=512; Here, we create an HTML canvas element programmatically and define a fixed width. Now that we've got a canvas, we need to render the clock that we use as the input for this recipe on it. The library is very easy to use; all you have to do is pass in the canvas element we just created: clock(canvas); At this point, we've got a canvas that renders and updates an image of a clock. What we need to do now is create a geometry and a material and use this canvas element as a texture for this material: var cubeGeometry = new THREE.BoxGeometry(10, 10, 10); var cubeMaterial = new THREE.MeshLambertMaterial(); cubeMaterial.map = new THREE.Texture(canvas); var cube = new THREE.Mesh(cubeGeometry, cubeMaterial); To create a texture from a canvas element, all we need to do is create a new instance of THREE.Texture and pass in the canvas element we created in step 1. We assign this texture to the cubeMaterial.map property, and that's it. If you run the recipe at this step, you might see the clock rendered on the sides of the cubes. However, the clock won't update itself. We need to tell Three.js that the canvas element has been changed. We do this by adding the following to the rendering loop: cubeMaterial.map.needsUpdate = true; This informs Three.js that our canvas texture has changed and needs to be updated the next time the scene is rendered. With these four simple steps, you can easily create interactive textures and use everything you can create on a canvas element as a texture in Three.js. How it works... How this works is actually pretty simple. Three.js uses WebGL to render scenes and apply textures. WebGL has native support for using HTML canvas element as textures, so Three.js just passes on the provided canvas element to WebGL and it is processed as any other texture. Making part of an object transparent You can create a lot of interesting visualizations using the various materials available with Three.js. In this recipe, we'll look at how you can use the materials available with Three.js to make part of an object transparent. This will allow you to create complex-looking geometries with relative ease. Getting ready Before we dive into the required steps in Three.js, we first need to have the texture that we will use to make an object partially transparent. For this recipe, we will use the following texture, which was created in Photoshop: You don't have to use Photoshop; the only thing you need to keep in mind is that you use an image with a transparent background. Using this texture, in this recipe, we'll show you how you can create the following (04.08-make-part-of-object-transparent.html): As you can see in the preceeding, only part of the sphere is visible, and you can look through the sphere to see the back at the other side of the sphere. How to do it... Let's look at the steps you need to take to accomplish this: The first thing we do is create the geometry. For this recipe, we use THREE.SphereGeometry: var sphereGeometry = new THREE.SphereGeometry(6, 20, 20); Just like all the other recipes, you can use whatever geometry you want. In the second step, we create the material: var mat = new THREE.MeshPhongMaterial(); mat.map = new THREE.ImageUtils.loadTexture( "../assets/textures/partial-transparency.png"); mat.transparent = true; mat.side = THREE.DoubleSide; mat.depthWrite = false; mat.color = new THREE.Color(0xff0000); As you can see in this fragment, we create THREE.MeshPhongMaterial and load the texture we saw in the Getting ready section of this recipe. To render this correctly, we also need to set the side property to THREE.DoubleSide so that the inside of the sphere is also rendered, and we need to set the depthWrite property to false. This will tell WebGL that we still want to test our vertices against the WebGL depth buffer, but we don't write to it. Often, you need to set this to false when working with more complex transparent objects or particles. Finally, add the sphere to the scene: var sphere = new THREE.Mesh(sphereGeometry, mat); scene.add(sphere); With these simple steps, you can create really interesting effects by just experimenting with textures and geometries. There's more With Three.js, it is possible to repeat textures (refer to the Setup repeating textures recipe). You can use this to create interesting-looking objects such as this: The code required to set a texture to repeat is the following: var mat = new THREE.MeshPhongMaterial(); mat.map = new THREE.ImageUtils.loadTexture( "../assets/textures/partial-transparency.png"); mat.transparent = true; mat.map.wrapS = mat.map.wrapT = THREE.RepeatWrapping; mat.map.repeat.set( 4, 4 ); mat.depthWrite = false; mat.color = new THREE.Color(0x00ff00); By changing the mat.map.repeat.set values, you define how often the texture is repeated. Using a cubemap to create reflective materials With the approach Three.js uses to render scenes in real time, it is difficult and very computationally intensive to create reflective materials. Three.js, however, provides a way you can cheat and approximate reflectivity. For this, Three.js uses cubemaps. In this recipe, we'll explain how to create cubemaps and use them to create reflective materials. Getting ready A cubemap is a set of six images that can be mapped to the inside of a cube. They can be created from a panorama picture and look something like this: In Three.js, we map such a map on the inside of a cube or sphere and use that information to calculate reflections. The following screenshot (example 04.10-use-reflections.html) shows what this looks like when rendered in Three.js: As you can see in the preceeding screenshot, the objects in the center of the scene reflect the environment they are in. This is something often called a skybox. To get ready, the first thing we need to do is get a cubemap. If you search on the Internet, you can find some ready-to-use cubemaps, but it is also very easy to create one yourself. For this, go to http://gonchar.me/panorama/. On this page, you can upload a panoramic picture and it will be converted to a set of pictures you can use as a cubemap. For this, perform the following steps: First, get a 360 degrees panoramic picture. Once you have one, upload it to the http://gonchar.me/panorama/ website by clicking on the large OPEN button:  Once uploaded, the tool will convert the panorama picture to a cubemap as shown in the following screenshot:  When the conversion is done, you can download the various cube map sites. The recipe in this book uses the naming convention provided by Cube map sides option, so download them. You'll end up with six images with names such as right.png, left.png, top.png, bottom.png, front.png, and back.png. Once you've got the sides of the cubemap, you're ready to perform the steps in the recipe. How to do it... To use the cubemap we created in the previous section and create reflecting material,we need to perform a fair number of steps, but it isn't that complex: The first thing you need to do is create an array from the cubemap images you downloaded: var urls = [ '../assets/cubemap/flowers/right.png', '../assets/cubemap/flowers/left.png', '../assets/cubemap/flowers/top.png', '../assets/cubemap/flowers/bottom.png', '../assets/cubemap/flowers/front.png', '../assets/cubemap/flowers/back.png' ]; With this array, we can create a cubemap texture like this: var cubemap = THREE.ImageUtils.loadTextureCube(urls); cubemap.format = THREE.RGBFormat; From this cubemap, we can use THREE.BoxGeometry and a custom THREE.ShaderMaterial object to create a skybox (the environment surrounding our meshes): var shader = THREE.ShaderLib[ "cube" ]; shader.uniforms[ "tCube" ].value = cubemap; var material = new THREE.ShaderMaterial( { fragmentShader: shader.fragmentShader, vertexShader: shader.vertexShader, uniforms: shader.uniforms, depthWrite: false, side: THREE.DoubleSide }); // create the skybox var skybox = new THREE.Mesh( new THREE.BoxGeometry( 10000, 10000, 10000 ), material ); scene.add(skybox); Three.js provides a custom shader (a piece of WebGL code) that we can use for this. As you can see in the code snippet, to use this WebGL code, we need to define a THREE.ShaderMaterial object. With this material, we create a giant THREE.BoxGeometry object that we add to scene. Now that we've created the skybox, we can define the reflecting objects: var sphereGeometry = new THREE.SphereGeometry(4,15,15); var envMaterial = new THREE.MeshBasicMaterial( {envMap:cubemap}); var sphere = new THREE.Mesh(sphereGeometry, envMaterial); As you can see, we also pass in the cubemap we created as a property (envmap) to the material. This informs Three.js that this object is positioned inside a skybox, defined by the images that make up cubemap. The last step is to add the object to the scene, and that's it: scene.add(sphere); In the example in the beginning of this recipe, you saw three geometries. You can use this approach with all different types of geometries. Three.js will determine how to render the reflective area. How it works... Three.js itself doesn't really do that much to render the cubemap object. It relies on a standard functionality provided by WebGL. In WebGL, there is a construct called samplerCube. With samplerCube, you can sample, based on a specific direction, which color matches the cubemap object. Three.js uses this to determine the color value for each part of the geometry. The result is that on each mesh, you can see a reflection of the surrounding cubemap using the WebGL textureCube function. In Three.js, this results in the following call (taken from the WebGL shader in GLSL): vec4 cubeColor = textureCube( tCube, vec3( -vReflect.x, vReflect.yz ) ); A more in-depth explanation on how this works can be found at http://codeflow.org/entries/2011/apr/18/advanced-webgl-part-3-irradiance-environment-map/#cubemap-lookup. There's more... In this recipe, we created the cubemap object by providing six separate images. There is, however, an alternative way to create the cubemap object. If you've got a 360 degrees panoramic image, you can use the following code to directly create a cubemap object from that image: var texture = THREE.ImageUtils.loadTexture( 360-degrees.png', new THREE.UVMapping()); Normally when you create a cubemap object, you use the code shown in this recipe to map it to a skybox. This usually gives the best results but requires some extra code. You can also use THREE.SphereGeometry to create a skybox like this: var mesh = new THREE.Mesh( new THREE.SphereGeometry( 500, 60, 40 ), new THREE.MeshBasicMaterial( { map: texture })); mesh.scale.x = -1; This applies the texture to a sphere and with mesh.scale, turns this sphere inside out. Besides reflection, you can also use a cubemap object for refraction (think about light bending through water drops or glass objects): All you have to do to make a refractive material is load the cubemap object like this: var cubemap = THREE.ImageUtils.loadTextureCube(urls, new THREE.CubeRefractionMapping()); And define the material in the following way: var envMaterial = new THREE.MeshBasicMaterial({envMap:cubemap}); envMaterial.refractionRatio = 0.95; Summary In this article, we learned about the different textures and materials supported by Three.js Resources for Article:  Further resources on this subject: Creating the maze and animating the cube [article] Working with the Basic Components That Make Up a Three.js Scene [article] Mesh animation [article]
Read more
  • 0
  • 0
  • 17978

article-image-4-common-challenges-web-scraping-handle
Amarabha Banerjee
08 Mar 2018
13 min read
Save for later

4 common challenges in Web Scraping and how to handle them

Amarabha Banerjee
08 Mar 2018
13 min read
[box type="note" align="" class="" width=""]Our article is an excerpt from the book Web Scraping with Python, written by Richard Lawson. This book contains step by step tutorials on how to leverage Python programming techniques for ethical web scraping. [/box] In this article, we will explore primary challenges of Web Scraping and how to get away with it easily. Developing a reliable scraper is never easy, there are so many what ifs that we need to take into account. What if the website goes down? What if the response returns unexpected data? What if your IP is throttled or blocked? What if authentication is required? While we can never predict and cover all what ifs, we will discuss some common traps, challenges, and workarounds. Note that several of the recipes require access to a website that I have provided as a Docker container. They require more logic than the simple, static site we used in earlier chapters. Therefore, you will need to pull and run a Docker container using the following Docker commands: docker pull mheydt/pywebscrapecookbook docker run -p 5001:5001 pywebscrapecookbook Retrying failed page downloads Failed page requests can be easily handled by Scrapy using retry middleware. When installed, Scrapy will attempt retries when receiving the following HTTP error codes: [500, 502, 503, 504, 408] The process can be further configured using the following parameters: RETRY_ENABLED (True/False - default is True) RETRY_TIMES (# of times to retry on any errors - default is 2) RETRY_HTTP_CODES (a list of HTTP error codes which should be retried - default is [500, 502, 503, 504, 408]) How to do it The 06/01_scrapy_retry.py script demonstrates how to configure Scrapy for retries. The script file contains the following configuration for Scrapy: process = CrawlerProcess({ 'LOG_LEVEL': 'DEBUG', 'DOWNLOADER_MIDDLEWARES': { "scrapy.downloadermiddlewares.retry.RetryMiddleware": 500 }, 'RETRY_ENABLED': True, 'RETRY_TIMES': 3 }) process.crawl(Spider) process.start() How it works Scrapy will pick up the configuration for retries as specified when the spider is run. When encountering errors, Scrapy will retry up to three times before giving up. Supporting page redirects Page redirects in Scrapy are handled using redirect middleware, which is enabled by default. The process can be further configured using the following parameters: REDIRECT_ENABLED: (True/False - default is True) REDIRECT_MAX_TIMES: (The maximum number of redirections to follow for any single request - default is 20) How to do it The script in 06/02_scrapy_redirects.py demonstrates how to configure Scrapy to handle redirects. This configures a maximum of two redirects for any page. Running the script reads the NASA sitemap and crawls that content. This contains a large number of redirects, many of which are redirects from HTTP to HTTPS versions of URLs. There will be a lot of output, but here are a few lines demonstrating the output: Parsing: <200 https://www.nasa.gov/content/earth-expeditions-above/> ['http://www.nasa.gov/content/earth-expeditions-above', 'https://www.nasa.gov/content/earth-expeditions-above'] This particular URL was processed after one redirection, from an HTTP to an HTTPS version of the URL. The list defines all of the URLs that were involved in the redirection. You will also be able to see where redirection exceeded the specified level (2) in the output pages. The following is one example: 2017-10-22 17:55:00 [scrapy.downloadermiddlewares.redirect] DEBUG: Discarding <GET http://www.nasa.gov/topics/journeytomars/news/index.html>: max redirections reached How it works The spider is defined as the following: class Spider(scrapy.spiders.SitemapSpider): name = 'spider' sitemap_urls = ['https://www.nasa.gov/sitemap.xml'] def parse(self, response): print("Parsing: ", response) print (response.request.meta.get('redirect_urls')) This is identical to our previous NASA sitemap based crawler, with the addition of one line printing the redirect_urls. In any call to parse, this metadata will contain all redirects that occurred to get to this page. The crawling process is configured with the following code: process = CrawlerProcess({ 'LOG_LEVEL': 'DEBUG', 'DOWNLOADER_MIDDLEWARES': { "scrapy.downloadermiddlewares.redirect.RedirectMiddleware": 500 }, 'REDIRECT_ENABLED': True, 'REDIRECT_MAX_TIMES': 2 }) Redirect is enabled by default, but this sets the maximum number of redirects to 2 instead of the default of 20. Waiting for content to be available in Selenium A common problem with dynamic web pages is that even after the whole page has loaded, and hence the get() method in Selenium has returned, there still may be content that we need to access later as there are outstanding Ajax requests from the page that are still pending completion. An example of this is needing to click a button, but the button not being enabled until all data has been loaded asynchronously to the page after loading. Take the following page as an example: http://the-internet.herokuapp.com/dynamic_loading/2. This page finishes loading very quickly and presents us with a Start button: When pressing the button, we are presented with a progress bar for five seconds: And when this is completed, we are presented with Hello World! Now suppose we want to scrape this page to get the content that is exposed only after the button is pressed and after the wait? How do we do this? How to do it We can do this using Selenium. We will use two features of Selenium. The first is the ability to click on page elements. The second is the ability to wait until an element with a specific ID is available on the page. First, we get the button and click it. The button's HTML is the following: <div id='start'> <button>Start</button> </div> When the button is pressed and the load completes, the following HTML is added to the document: <div id='finish'> <h4>Hello World!"</h4> </div> We will use the Selenium driver to find the Start button, click it, and then wait until a div with an ID of 'finish' is available. Then we get that element and return the text in the enclosed <h4> tag. You can try this by running 06/03_press_and_wait.py. It's output will be the following: clicked Hello World! Now let's see how it worked. How it works Let us break down the explanation: We start by importing the required items from Selenium: from selenium import webdriver from selenium.webdriver.support import ui Now we load the driver and the page: driver = webdriver.PhantomJS() driver.get("http://the-internet.herokuapp.com/dynamic_loading/2") With the page loaded, we can retrieve the button: button = driver.find_element_by_xpath("//*/div[@id='start']/button") And then we can click the button: button.click() print("clicked") Next we create a WebDriverWait object: wait = ui.WebDriverWait(driver, 10) With this object, we can request Selenium's UI wait for certain events. This also sets a maximum wait of 10 seconds. Now using this, we can wait until we meet a criterion; that an element is identifiable using the following XPath: wait.until(lambda driver: driver.find_element_by_xpath("//*/div[@id='finish']")) When this completes, we can retrieve the h4 element and get its enclosing text: finish_element=driver.find_element_by_xpath("//*/div[@id='finish']/ h4") print(finish_element.text) Limiting crawling to a single domain We can inform Scrapy to limit the crawl to only pages within a specified set of domains. This is an important task, as links can point to anywhere on the web, and we often want to control where crawls end up going. Scrapy makes this very easy to do. All that needs to be done is setting the allowed_domains field of your scraper class. How to do it The code for this example is 06/04_allowed_domains.py. You can run the script with your Python interpreter. It will execute and generate a ton of output, but if you keep an eye on it, you will see that it only processes pages on nasa.gov. How it works The code is the same as previous NASA site crawlers except that we include allowed_domains=['nasa.gov']: class Spider(scrapy.spiders.SitemapSpider): name = 'spider' sitemap_urls = ['https://www.nasa.gov/sitemap.xml'] allowed_domains=['nasa.gov'] def parse(self, response): print("Parsing: ", response) The NASA site is fairly consistent with staying within its root domain, but there are occasional links to other sites such as content on boeing.com. This code will prevent moving to those external sites. Processing infinitely scrolling pages Many websites have replaced "previous/next" pagination buttons with an infinite scrolling mechanism. These websites use this technique to load more data when the user has reached the bottom of the page. Because of this, strategies for crawling by following the "next page" link fall apart. While this would seem to be a case for using browser automation to simulate the scrolling, it's actually quite easy to figure out the web pages' Ajax requests and use those for crawling instead of the actual page.  Let's look at spidyquotes.herokuapp.com/scroll as an example. Getting ready Open http://spidyquotes.herokuapp.com/scroll in your browser. This page will load additional content when you scroll to the bottom of the page: Screenshot of the quotes to scrape Once the page is open, go into your developer tools and select the network panel. Then, scroll to the bottom of the page. You will see new content in the network panel: When we click on one of the links, we can see the following JSON: { "has_next": true, "page": 2, "quotes": [{ "author": { "goodreads_link": "/author/show/82952.Marilyn_Monroe", "name": "Marilyn Monroe", "slug": "Marilyn-Monroe" }, "tags": ["friends", "heartbreak", "inspirational", "life", "love", "sisters"], "text": "u201cThis life is what you make it...." }, { "author": { "goodreads_link": "/author/show/1077326.J_K_Rowling", "name": "J.K. Rowling", "slug": "J-K-Rowling" }, "tags": ["courage", "friends"], "text": "u201cIt takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.u201d" }, This is great because all we need to do is continually generate requests to /api/quotes?page=x, increasing x until the has_next tag exists in the reply document. If there are no more pages, then this tag will not be in the document. How to do it The 06/05_scrapy_continuous.py file contains a Scrapy agent, which crawls this set of pages. Run it with your Python interpreter and you will see output similar to the following (the following is multiple excerpts from the output): <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> 2017-10-29 16:17:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> {'text': "“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”", 'author': 'Marilyn Monroe', 'tags': ['friends', 'heartbreak', 'inspirational', 'life', 'love', 'Sisters']} 2017-10-29 16:17:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> {'text': '“It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”', 'author': 'J.K. Rowling', 'tags': ['courage', 'friends']} 2017-10-29 16:17:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> {'text': "“If you can't explain it to a six year old, you don't understand it yourself.”", 'author': 'Albert Einstein', 'tags': ['simplicity', 'Understand']} When this gets to page 10 it will stop as it will see that there is no next page flag set in the Content. How it works Let's walk through the spider to see how this works. The spider starts with the following definition of the start URL: class Spider(scrapy.Spider): name = 'spidyquotes' quotes_base_url = 'http://spidyquotes.herokuapp.com/api/quotes' start_urls = [quotes_base_url] download_delay = 1.5 The parse method then prints the response and also parses the JSON into the data variable: def parse(self, response): print(response) data = json.loads(response.body) Then it loops through all the items in the quotes element of the JSON objects. For each item, it yields a new Scrapy item back to the Scrapy engine: for item in data.get('quotes', []): yield { 'text': item.get('text'), 'author': item.get('author', {}).get('name'), 'tags': item.get('tags'), } It then checks to see if the data JSON variable has a 'has_next' property, and if so it gets the next page and yields a new request back to Scrapy to parse the next page: if data['has_next']: next_page = data['page'] + 1 yield scrapy.Request(self.quotes_base_url + "?page=%s" % next_page) There's more... It is also possible to process infinite, scrolling pages using Selenium. The following code is in 06/06_scrape_continuous_twitter.py: from selenium import webdriver import time driver = webdriver.PhantomJS() print("Starting") driver.get("https://twitter.com") scroll_pause_time = 1.5 # Get scroll height last_height = driver.execute_script("return document.body.scrollHeight") while True: print(last_height) # Scroll down to bottom driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # Wait to load page time.sleep(scroll_pause_time) # Calculate new scroll height and compare with last scroll height new_height = driver.execute_script("return document.body.scrollHeight") print(new_height, last_height) if new_height == last_height: break last_height = new_height The output would be similar to the following: Starting 4882 8139 4882 8139 11630 8139 11630 15055 11630 15055 15055 15055 Process finished with exit code 0 This code starts by loading the page from Twitter. The call to .get() will return when the page is fully loaded. The scrollHeight is then retrieved, and the program scrolls to that height and waits for a moment for the new content to load. The scrollHeight of the browser is retrieved again, and if different than last_height, it will loop and continue processing. If the same as last_height, no new content has loaded and you can then continue on and retrieve the HTML for the completed page. We have discussed the common challenges faced in performing Web Scraping using Python and got to know their workaround. If you liked this post, be sure to check out Web Scraping with Python, which consists of useful recipes to work with Python and perform efficient web scraping.
Read more
  • 0
  • 0
  • 17925

article-image-3-different-types-of-generative-adversarial-networks-gans-and-how-they-work
Packt Editorial Staff
08 Jan 2020
6 min read
Save for later

3 different types of generative adversarial networks (GANs) and how they work

Packt Editorial Staff
08 Jan 2020
6 min read
Generative adversarial networks (GANs) have been greeted with real excitement since their creation back in 2014 by Ian Goodfellow and his research team. Yann LeCun, Facebook's Director of AI Research went as far as describing GANs as "the most interesting idea in the last 10 years in ML." With all this excitement, however, it can be easy to miss the subtle diversity of GANs; there are a number of different types of generative adversarial networks, each one working in slightly different ways and helping engineers to achieve slightly different results. To give you a deeper insight on GANs, in this article we'll look at three different generative adversarial networks: SRGANs, CycleGANs, and InfoGANs. We'll explore how these different GANs work and how they can be used. This should give you a solid foundation to explore GANs in more depth and begin to apply them in your own experiments and projects. This article is an excerpt from the book, Deep Learning with TensorFlow 2 and Keras, Second Edition by Antonio Gulli, Amita Kapoor, and Sujit Pal.  SRGAN - Super Resolution GANs Remember seeing a crime-thriller where our hero asks the computer guy to magnify the faded image of the crime scene? With the zoom we are able to see the criminal’s face in detail, including the weapon used and anything engraved upon it! Well, SRGAN can perform similar magic. Here a GAN is trained in such a way that it can generate a photorealistic high-resolution image when given a low-resolution image. The SRGAN architecture consists of three neural networks: a very deep generator network, a discriminator network, and a pretrained VGG-16 network. How do SRGANs work? SRGANs use the perceptual loss function (developed by Johnson et al, Perceptual Losses for Real-Time Style Transfer and Super-Resolution). The difference in the feature map activations in high layers of a VGG network between the network output part and the high-resolution part comprises the perceptual loss function. Besides perceptual loss, the authors further added content loss and an adversarial loss so that images generated look more natural and the finer details more artistic. The perceptual loss is defined as the weighted sum of content loss and adversarial loss: lSR = lSR X+ 10−3×lSRGen The first term on the right-hand side is the content loss, obtained using the feature maps generated by pretrained VGG 19. Mathematically it is the Euclidean distance between the feature map of the reconstructed image (that is the one generated by the generator) and the original high-resolution reference image. The second term on the right-hand side is the adversarial loss. It is the standard generative loss term, designed to ensure that images generated by the generator are able to fool the discriminator. You can see in the following figure taken from the original paper that the image generated by SRGAN is much closer to the original high-resolution image: [caption id="attachment_31006" align="aligncenter" width="907"] image via https://arxiv.org/pdf/1609.04802.pdf[/caption] CycleGAN Another noteworthy architecture is CycleGAN; proposed in 2017, it can perform the task of image translation. Once trained you can translate an image from one domain to another domain. For example, when trained on horse and zebra data set, if you give it an image with horses in the ground, the CycleGAN can convert the horses to zebra with the same background. How does CycleGAN work? Have you ever imagined how a scenery would look if Van Gogh or Manet had painted it? We have many sceneries, and many landscapes painted by Gogh/Manet, but we do not have any collection of input-output pairs. CycleGAN performs the image translation, that is, transfers an image given in one domain (scenery for example) to another domain (Van Gogh painting of the same scene, for instance) in the absence of training examples. CycleGAN’s ability to perform image translation in the absence of training pairs is what makes it unique. To achieve image translation the authors of CycleGAN used a very simple and yet effective procedure. They made use of two GANs, the generator of each GAN performing the image translation from one domain to another. To elaborate, let us say the input is X, then the generator of the first GAN performs a mapping G: X → Y, thus its output would be Y = G(X). The generator of the second GAN performs an inverse mapping F: Y → X, resulting in X = F(Y). Each discriminator is trained to distinguish between real images and synthesized images. The idea is shown as follows: To train the combined GANs, the authors added beside the conventional GAN adversarial loss a forward cycle consistency loss (left figure) and a backward cycle consistency loss (right figure). This ensures that if an image X is given as input, then after the two translations F(G(X)) ~ X the obtained image is the same X (similarly the backward cycle consistency loss ensures the G(F(Y)) ~ Y). Following are some of the successful image translations by CycleGAN: Following are few more examples, you can see the translation of seasons (summer → winter), photo → painting and vice versa, horses → zebra: InfoGAN The GAN architectures that we have considered up to now provide us with little or no control over the generated images. InfoGAN changes this; it provides control over various attributes of the images generated. The InfoGAN uses concepts from information theory such that the noise term is transformed into latent codes which provide predictable and systematic control over the output. How does InfoGAN work? The generator in InfoGAN takes two inputs the latent space Z and a latent code c, thus the output of generator is G(Z,c). The GAN is trained such that it maximizes the mutual information between the latent code c and the generated image G(Z,c). The following figure shows the architecture of InfoGAN:   The concatenated vector (Z,c) is fed to the Generator. Q(c|X) is also a neural network, combined with the generator it works to form a mapping between random noise Z and its latent code c_hat, it aims to estimate c given X. This is achieved by adding a regularization term to the objective function of conventional GAN: minDmaxG VI(D,G) = VG(D,G) −λI(c;G(Z,c)) The term VG(D,G) is the loss function of conventional GAN, and the second term is the regularization term, where λ is a constant. Its value was set to 1 in the paper, and I(c;G(Z,c)) is the mutual information between the latent code c and the Generator generated image G(Z,c). Below is the results of InfoGAN on the MNIST dataset: That concludes our brief look at three different types of generative adversarial networks. You can find the book from which this article was taken on the Packt store or you can read the first chapter for free on the Packt subscription platform.
Read more
  • 0
  • 0
  • 16714
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $15.99/month. Cancel anytime
article-image-implementing-face-detection-using-haar-cascades-adaboost-algorithm
Sugandha Lahoti
20 Feb 2018
7 min read
Save for later

Implementing face detection using the Haar Cascades and AdaBoost algorithm

Sugandha Lahoti
20 Feb 2018
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Ankit Dixit titled Ensemble Machine Learning. This book serves as an effective guide to using ensemble techniques to enhance machine learning models.[/box] In today’s tutorial, we will learn how to apply the AdaBoost classifier in face detection using Haar cascades. Face detection using Haar cascades Object detection using Haar feature-based cascade classifiers is an effective object detection method proposed by Paul Viola and Michael Jones in their paper Rapid Object Detection using a Boosted Cascade of Simple Features in 2001. It is a machine-learning-based approach where a cascade function is trained from a lot of positive and negative images. It is then used to detect objects in other images. Here, we will work with face detection. Initially, the algorithm needs a lot of positive images (images of faces) and negative images (images without faces) to train the classifier. Then we need to extract features from it. Features are nothing but numerical information extracted from the images that can be used to distinguish one image from another; for example, a histogram (distribution of intensity values) is one of the features that can be used to define several characteristics of an image even without looking at the image, such as dark or bright image, the intensity range of the image, contrast, and so on. We will use Haar features to detect faces in an image. Here is a figure showing different Haar features: These features are just like the convolution kernel; to know about convolution, you need to wait for the following chapters. For a basic understanding, convolutions can be described as in the following figure: So we can summarize convolution with these steps: Pick a pixel location from the image. Now crop a sub-image with the selected pixel as the center from the source image with the same size as the convolution kernel. Calculate an element-wise product between the values of the kernel and sub- image. Add the result of the product. Put the resultant value into the new image at the same place where you picked up the pixel location. Now we are going to do a similar kind of procedure, but with a slight difference for our images. Each feature of ours is a single value obtained by subtracting the sum of the pixels under the white rectangle from the sum of the pixels under the black rectangle. Now, all possible sizes and locations of each kernel are used to calculate plenty of features. (Just imagine how much computation it needs. Even a 24x24 window results in over 160,000 features.) For each feature calculation, we need to find the sum of the pixels under the white and black rectangles. To solve this, we will use the concept of integral image; we will discuss this concept very briefly here, as it's not a part of our context. Integral image Integral images are those images in which the pixel value at any (x,y) location is the sum of the all pixel values present before the current pixel. Its use can be understood by the following example: Image on the left and the integral image on the right. Let's see how this concept can help reduce computation time; let us assume a matrix A of size 5x5 representing an image, as shown here: Now, let's say we want to calculate the average intensity over the area highlighted: Region for addition Normally, you'd do the following: 9 + 1 + 2 + 6 + 0 + 5 + 3 + 6 + 5 = 37 37 / 9 = 4.11 This requires a total of 9 operations. Doing the same for 100 such operations would require: 100 * 9 = 900 operations. Now, let us first make a integral image of the preceding image: Making this image requires a total of 56 operations. Again, focus on the highlighted portion: To calculate the avg intensity, all you have to do is: (76 - 20) - (24 - 5) = 37 37 / 9 = 4.11 This required a total of 4 operations. To do this for 100 such operations, we would require: 56 + 100 * 4 = 456 operations. For just a hundred operations over a 5x5 matrix, using an integral image requires about 50% less computations. Imagine the difference it makes for large images and other such operations. Creation of an integral image changes other sum difference operations by almost O(1) time complexity, thereby decreasing the number of calculations. It simplifies the calculation of the sum of pixels—no matter how large the number of pixels—to an operation involving just four pixels. Nice, isn't it? It makes things superfast. However, among all of these features we calculated, most of them are irrelevant. For example, consider the following image. The top row shows two good features. The first feature selected seems to focus on the property that the region of the eyes is often darker than the region of the nose and cheeks. The second feature selected relies on the property that the eyes are darker than the bridge of the nose. But the same windows applying on cheeks or any other part is irrelevant. So how do we select the best features out of 160000+ features? It is achieved by AdaBoost. To do this, we apply each and every feature on all the training images. For each feature, it finds the best threshold that will classify the faces as positive and negative. Obviously, there will be errors or misclassifications. We select the features with the minimum error rate, which means they are the features that best classify the face and non-face images. Note: The process is not as simple as this. Each image is given an equal weight in the       beginning. After each classification, the weights of misclassified images are increased. Again, the same process is done. New error rates are calculated among the new weights. This process continues until the required accuracy or error rate is achieved or the required number of features is found. The final classifier is a weighted sum of these weak classifiers. It is called weak because it alone can't classify the image, but together with others, it forms a strong classifier. The paper says that even 200 features provide detection with 95% accuracy. Their final setup had around 6,000 features. (Imagine a reduction from 160,000+ to 6000 features. That is a big gain.) Face detection framework using the Haar cascade and AdaBoost algorithm So now, you take an image take each 24x24 window, apply 6,000 features to it, and check if it is a face or not. Wow! Wow! Isn't this a little inefficient and time consuming? Yes, it is. The authors of the algorithm have a good solution for that. In an image, most of the image region is non-face. So it is a better idea to have a simple method to verify that a window is not a face region. If it is not, discard it in a single shot. Don’t process it again. Instead, focus on the region where there can be a face. This way, we can find more time to check a possible face region. For this, they introduced the concept of a cascade of classifiers. Instead of applying all the 6,000 features to a window, we group the features into different stages of classifiers and apply one by one (normally first few stages will contain very few features). If a window fails in the first stage, discard it. We don’t consider the remaining features in it. If it passes, apply the second stage of features and continue the process. The window that passes all stages is a face region. How cool is the plan!!! The authors' detector had 6,000+ features with 38 stages, with 1, 10, 25, 25, and 50 features in the first five stages (two features in the preceding image were actually obtained as the best two features from AdaBoost). According to the authors, on average, 10 features out of 6,000+ are evaluated per subwindow. So this is a simple, intuitive explanation of how Viola-Jones face detection works. Read the paper for more details. If you found this post useful, do check out the book Ensemble Machine Learning to learn different machine learning aspects such as bagging, boosting, and stacking.    
Read more
  • 0
  • 0
  • 16602

article-image-how-to-handle-categorical-data-for-machine-learning-algorithms
Packt Editorial Staff
20 Sep 2019
9 min read
Save for later

How to handle categorical data for machine learning algorithms

Packt Editorial Staff
20 Sep 2019
9 min read
The quality of data and the amount of useful information are key factors that determine how well a machine learning algorithm can learn. Therefore, it is absolutely critical that we make sure to encode categorical variables correctly, before we feed data into a machine learning algorithm. In this article, with simple yet effective examples we will explain how to deal with categorical data in computing machine learning algorithms and how we to map ordinal and nominal feature values to integer representations. The article is an excerpt from the book Python Machine Learning - Third Edition by Sebastian Raschka and Vahid Mirjalili. This book is a comprehensive guide to machine learning and deep learning with Python. It acts as both a clear step-by-step tutorial, and a reference you’ll keep coming back to as you build your machine learning systems.  It is not uncommon that real-world datasets contain one or more categorical feature columns. When we are talking about categorical data, we have to further distinguish between nominal and ordinal features. Ordinal features can be understood as categorical values that can be sorted or ordered. For example, t-shirt size would be an ordinal feature, because we can define an order XL > L > M. In contrast, nominal features don't imply any order and, to continue with the previous example, we could think of t-shirt color as a nominal feature since it typically doesn't make sense to say that, for example, red is larger than blue. Categorical data encoding with pandas Before we explore different techniques to handle such categorical data, let's create a new DataFrame to illustrate the problem: >>> import pandas as pd >>> df = pd.DataFrame([ ...            ['green', 'M', 10.1, 'class1'], ...            ['red', 'L', 13.5, 'class2'], ...            ['blue', 'XL', 15.3, 'class1']]) >>> df.columns = ['color', 'size', 'price', 'classlabel'] >>> df color  size price  classlabel 0   green     M 10.1     class1 1     red   L 13.5      class2 2    blue   XL 15.3      class1 As we can see in the preceding output, the newly created DataFrame contains a nominal feature (color), an ordinal feature (size), and a numerical feature (price) column. The class labels (assuming that we created a dataset for a supervised learning task) are stored in the last column. Mapping ordinal features To make sure that the learning algorithm interprets the ordinal features correctly, we need to convert the categorical string values into integers. Unfortunately, there is no convenient function that can automatically derive the correct order of the labels of our size feature, so we have to define the mapping manually. In the following simple example, let's assume that we know the numerical difference between features, for example, XL = L + 1 = M + 2: >>> size_mapping = { ...                 'XL': 3, ...                 'L': 2, ...                 'M': 1} >>> df['size'] = df['size'].map(size_mapping) >>> df color  size price  classlabel 0   green     1 10.1     class1 1     red   2 13.5      class2 2    blue     3 15.3     class1 If we want to transform the integer values back to the original string representation at a later stage, we can simply define a reverse-mapping dictionary inv_size_mapping = {v: k for k, v in size_mapping.items()} that can then be used via the pandas map method on the transformed feature column, similar to the size_mapping dictionary that we used previously. We can use it as follows: >>> inv_size_mapping = {v: k for k, v in size_mapping.items()} >>> df['size'].map(inv_size_mapping) 0   M 1   L 2   XL Name: size, dtype: object Encoding class labels Many machine learning libraries require that class labels are encoded as integer values. Although most estimators for classification in scikit-learn convert class labels to integers internally, it is considered good practice to provide class labels as integer arrays to avoid technical glitches. To encode the class labels, we can use an approach similar to the mapping of ordinal features discussed previously. We need to remember that class labels are not ordinal, and it doesn't matter which integer number we assign to a particular string label. Thus, we can simply enumerate the class labels, starting at 0: >>> import numpy as np >>> class_mapping = {label:idx for idx,label in ...                  enumerate(np.unique(df['classlabel']))} >>> class_mapping {'class1': 0, 'class2': 1} Next, we can use the mapping dictionary to transform the class labels into integers: >>> df['classlabel'] = df['classlabel'].map(class_mapping) >>> df     color  size price  classlabel 0   green     1 10.1         0 1     red   2 13.5           1 2    blue     3 15.3           0 We can reverse the key-value pairs in the mapping dictionary as follows to map the converted class labels back to the original string representation: >>> inv_class_mapping = {v: k for k, v in class_mapping.items()} >>> df['classlabel'] = df['classlabel'].map(inv_class_mapping) >>> df     color  size price  classlabel 0   green     1 10.1     class1 1     red   2 13.5      class2 2    blue     3 15.3     class1 Alternatively, there is a convenient LabelEncoder class directly implemented in scikit-learn to achieve this: >>> from sklearn.preprocessing import LabelEncoder >>> class_le = LabelEncoder() >>> y = class_le.fit_transform(df['classlabel'].values) >>> y array([0, 1, 0]) Note that the fit_transform method is just a shortcut for calling fit and transform separately, and we can use the inverse_transform method to transform the integer class labels back into their original string representation: >>> class_le.inverse_transform(y) array(['class1', 'class2', 'class1'], dtype=object) Performing a technique ‘one-hot encoding’ on nominal features In the Mapping ordinal features section, we used a simple dictionary-mapping approach to convert the ordinal size feature into integers. Since scikit-learn's estimators for classification treat class labels as categorical data that does not imply any order (nominal), we used the convenient LabelEncoder to encode the string labels into integers. It may appear that we could use a similar approach to transform the nominal color column of our dataset, as follows: >>> X = df[['color', 'size', 'price']].values >>> color_le = LabelEncoder() >>> X[:, 0] = color_le.fit_transform(X[:, 0]) >>> X array([[1, 1, 10.1],        [2, 2, 13.5],        [0, 3, 15.3]], dtype=object) After executing the preceding code, the first column of the NumPy array X now holds the new color values, which are encoded as follows: blue = 0 green = 1 red = 2 If we stop at this point and feed the array to our classifier, we will make one of the most common mistakes in dealing with categorical data. Can you spot the problem? Although the color values don't come in any particular order, a learning algorithm will now assume that green is larger than blue, and red is larger than green. Although this assumption is incorrect, the algorithm could still produce useful results. However, those results would not be optimal. A common workaround for this problem is to use a technique called one-hot encoding. The idea behind this approach is to create a new dummy feature for each unique value in the nominal feature column. Here, we would convert the color feature into three new features: blue, green, and red. Binary values can then be used to indicate the particular color of an example; for example, a blue example can be encoded as blue=1, green=0, red=0. To perform this transformation, we can use the OneHotEncoder that is implemented in scikit-learn's preprocessing module: >>> from sklearn.preprocessing import OneHotEncoder >>> X = df[['color', 'size', 'price']].values >>> color_ohe = OneHotEncoder() >>> color_ohe.fit_transform(X[:, 0].reshape(-1, 1)).toarray()  array([[0., 1., 0.],            [0., 0., 1.],            [1., 0., 0.]]) Note that we applied the OneHotEncoder to a single column (X[:, 0].reshape(-1, 1))) only, to avoid modifying the other two columns in the array as well. If we want to selectively transform columns in a multi-feature array, we can use the ColumnTransformer that accepts a list of (name, transformer, column(s)) tuples as follows: >>> from sklearn.compose import ColumnTransformer >>> X = df[['color', 'size', 'price']].values >>> c_transf = ColumnTransformer([ ...     ('onehot', OneHotEncoder(), [0]), ...     ('nothing', 'passthrough', [1, 2]) ... ]) >>> c_transf.fit_transform(X) .astype(float)     array([[0.0, 1.0, 0.0, 1, 10.1],            [0.0, 0.0, 1.0, 2, 13.5],            [1.0, 0.0, 0.0, 3, 15.3]]) In the preceding code example, we specified that we only want to modify the first column and leave the other two columns untouched via the 'passthrough' argument. An even more convenient way to create those dummy features via one-hot encoding is to use the get_dummies method implemented in pandas. Applied to a DataFrame, the get_dummies method will only convert string columns and leave all other columns unchanged: >>> pd.get_dummies(df[['price', 'color', 'size']])     price  size color_blue  color_green color_red 0    10.1     1   0 1          0 1    13.5     2   0 0          1 2    15.3     3   1 0          0 When we are using one-hot encoding datasets, we have to keep in mind that it introduces multicollinearity, which can be an issue for certain methods (for instance, methods that require matrix inversion). If features are highly correlated, matrices are computationally difficult to invert, which can lead to numerically unstable estimates. To reduce the correlation among variables, we can simply remove one feature column from the one-hot encoded array. Note that we do not lose any important information by removing a feature column, though; for example, if we remove the column color_blue, the feature information is still preserved since if we observe color_green=0 and color_red=0, it implies that the observation must be blue. If we use the get_dummies function, we can drop the first column by passing a True argument to the drop_first parameter, as shown in the following code example: >>> pd.get_dummies(df[['price', 'color', 'size']], ...                drop_first=True)     price  size color_green  color_red 0    10.1     1     1 0 1    13.5     2     0 1 2    15.3     3     0 0 In order to drop a redundant column via the OneHotEncoder , we need to set drop='first' and set categories='auto' as follows: >>> color_ohe = OneHotEncoder(categories='auto', drop='first') >>> c_transf = ColumnTransformer([  ...            ('onehot', color_ohe, [0]), ...            ('nothing', 'passthrough', [1, 2]) ... ]) >>> c_transf.fit_transform(X).astype(float) array([[  1. , 0. ,  1. , 10.1],        [  0. ,  1. , 2. ,  13.5],        [  0. ,  0. , 3. ,  15.3]]) In this article, we have gone through some of the methods to deal with categorical data in datasets. We distinguished between nominal and ordinal features, and with examples we explained how they can be handled. To harness the power of the latest Python open source libraries in machine learning check out this book Python Machine Learning - Third Edition, written by Sebastian Raschka and Vahid Mirjalili. Other interesting read in data! The best business intelligence tools 2019: when to use them and how much they cost Introducing Microsoft’s AirSim, an open-source simulator for autonomous vehicles built on Unreal Engine Media manipulation by Deepfakes and cheap fakes require both AI and social fixes, finds a Data & Society report
Read more
  • 0
  • 0
  • 16281

article-image-building-recommendation-engine-spark
Packt
24 Feb 2016
44 min read
Save for later

Building a Recommendation Engine with Spark

Packt
24 Feb 2016
44 min read
In this article, we will explore individual machine learning models in detail, starting with recommendation engines. (For more resources related to this topic, see here.) Recommendation engines are probably among the best types of machine learning model known to the general public. Even if people do not know exactly what a recommendation engine is, they have most likely experienced one through the use of popular websites such as Amazon, Netflix, YouTube, Twitter, LinkedIn, and Facebook. Recommendations are a core part of all these businesses, and in some cases, they drive significant percentages of their revenue. The idea behind recommendation engines is to predict what people might like and to uncover relationships between items to aid in the discovery process (in this way, it is similar and, in fact, often complementary to search engines, which also play a role in discovery). However, unlike search engines, recommendation engines try to present people with relevant content that they did not necessarily search for or that they might not even have heard of. Typically, a recommendation engine tries to model the connections between users and some type of item. If we can do a good job of showing our users movies related to a given movie, we could aid in discovery and navigation on our site, again improving our users' experience, engagement, and the relevance of our content to them. However, recommendation engines are not limited to movies, books, or products. The techniques we will explore in this article can be applied to just about any user-to-item relationship as well as user-to-user connections, such as those found on social networks, allowing us to make recommendations such as people you may know or who to follow. Recommendation engines are most effective in two general scenarios (which are not mutually exclusive). They are explained here: Large number of available options for users: When there are a very large number of available items, it becomes increasingly difficult for the user to find something they want. Searching can help when the user knows what they are looking for, but often, the right item might be something previously unknown to them. In this case, being recommended relevant items, that the user may not already know about, can help them discover new items. A significant degree of personal taste involved: When personal taste plays a large role in selection, recommendation models, which often utilize a wisdom of the crowd approach, can be helpful in discovering items based on the behavior of others that have similar taste profiles. In this article, we will: Introduce the various types of recommendation engines Build a recommendation model using data about user preferences Use the trained model to compute recommendations for a given user as well compute similar items for a given item (that is, related items) Apply standard evaluation metrics to the model that we created to measure how well it performs in terms of predictive capability Types of recommendation models Recommender systems are widely studied, and there are many approaches used, but there are two that are probably most prevalent: content-based filtering and collaborative filtering. Recently, other approaches such as ranking models have also gained in popularity. In practice, many approaches are hybrids, incorporating elements of many different methods into a model or combination of models. Content-based filtering Content-based methods try to use the content or attributes of an item, together with some notion of similarity between two pieces of content, to generate items similar to a given item. These attributes are often textual content (such as titles, names, tags, and other metadata attached to an item), or in the case of media, they could include other features of the item, such as attributes extracted from audio and video content. In a similar manner, user recommendations can be generated based on attributes of users or user profiles, which are then matched to item attributes using the same measure of similarity. For example, a user can be represented by the combined attributes of the items they have interacted with. This becomes their user profile, which is then compared to item attributes to find items that match the user profile. Collaborative filtering Collaborative filtering is a form of wisdom of the crowd approach where the set of preferences of many users with respect to items is used to generate estimated preferences of users for items with which they have not yet interacted. The idea behind this is the notion of similarity. In a user-based approach, if two users have exhibited similar preferences (that is, patterns of interacting with the same items in broadly the same way), then we would assume that they are similar to each other in terms of taste. To generate recommendations for unknown items for a given user, we can use the known preferences of other users that exhibit similar behavior. We can do this by selecting a set of similar users and computing some form of combined score based on the items they have shown a preference for. The overall logic is that if others have tastes similar to a set of items, these items would tend to be good candidates for recommendation. We can also take an item-based approach that computes some measure of similarity between items. This is usually based on the existing user-item preferences or ratings. Items that tend to be rated the same by similar users will be classed as similar under this approach. Once we have these similarities, we can represent a user in terms of the items they have interacted with and find items that are similar to these known items, which we can then recommend to the user. Again, a set of items similar to the known items is used to generate a combined score to estimate for an unknown item. The user- and item-based approaches are usually referred to as nearest-neighbor models, since the estimated scores are computed based on the set of most similar users or items (that is, their neighbors). Finally, there are many model-based methods that attempt to model the user-item preferences themselves so that new preferences can be estimated directly by applying the model to unknown user-item combinations. Matrix factorization Since Spark's recommendation models currently only include an implementation of matrix factorization, we will focus our attention on this class of models. This focus is with good reason; however, these types of models have consistently been shown to perform extremely well in collaborative filtering and were among the best models in well-known competitions such as the Netflix prize. For more information on and a brief overview of the performance of the best algorithms for the Netflix prize, see http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html. Explicit matrix factorization When we deal with data that consists of preferences of users that are provided by the users themselves, we refer to explicit preference data. This includes, for example, ratings, thumbs up, likes, and so on that are given by users to items. We can take these ratings and form a two-dimensional matrix with users as rows and items as columns. Each entry represents a rating given by a user to a certain item. Since in most cases, each user has only interacted with a relatively small set of items, this matrix has only a few non-zero entries (that is, it is very sparse). As a simple example, let's assume that we have the following user ratings for a set of movies: Tom, Star Wars, 5 Jane, Titanic, 4 Bill, Batman, 3 Jane, Star Wars, 2 Bill, Titanic, 3 We will form the following ratings matrix: A simple movie-rating matrix Matrix factorization (or matrix completion) attempts to directly model this user-item matrix by representing it as a product of two smaller matrices of lower dimension. Thus, it is a dimensionality-reduction technique. If we have U users and I items, then our user-item matrix is of dimension U x I and might look something like the one shown in the following diagram: A sparse ratings matrix If we want to find a lower dimension (low-rank) approximation to our user-item matrix with the dimension k, we would end up with two matrices: one for users of size U x k and one for items of size I x k. These are known as factor matrices. If we multiply these two factor matrices, we would reconstruct an approximate version of the original ratings matrix. Note that while the original ratings matrix is typically very sparse, each factor matrix is dense, as shown in the following diagram: The user- and item-factor matrices These models are often also called latent feature models, as we are trying to discover some form of hidden features (which are represented by the factor matrices) that account for the structure of behavior inherent in the user-item rating matrix. While the latent features or factors are not directly interpretable, they might, perhaps, represent things such as the tendency of a user to like movies from a certain director, genre, style, or group of actors, for example. As we are directly modeling the user-item matrix, the prediction in these models is relatively straightforward: to compute a predicted rating for a user and item, we compute the vector dot product between the relevant row of the user-factor matrix (that is, the user's factor vector) and the relevant row of the item-factor matrix (that is, the item's factor vector). This is illustrated with the highlighted vectors in the following diagram: Computing recommendations from user- and item-factor vectors To find out the similarity between two items, we can use the same measures of similarity as we would use in the nearest-neighbor models, except that we can use the factor vectors directly by computing the similarity between two item-factor vectors, as illustrated in the following diagram: Computing similarity with item-factor vectors The benefit of factorization models is the relative ease of computing recommendations once the model is created. However, for very large user and itemsets, this can become a challenge as it requires storage and computation across potentially many millions of user- and item-factor vectors. Another advantage, as mentioned earlier, is that they tend to offer very good performance. Projects such as Oryx (https://github.com/OryxProject/oryx) and Prediction.io (https://github.com/PredictionIO/PredictionIO) focus on model serving for large-scale models, including recommenders based on matrix factorization. On the down side, factorization models are relatively more complex to understand and interpret compared to nearest-neighbor models and are often more computationally intensive during the model's training phase. Implicit matrix factorization So far, we have dealt with explicit preferences such as ratings. However, much of the preference data that we might be able to collect is implicit feedback, where the preferences between a user and item are not given to us, but are, instead, implied from the interactions they might have with an item. Examples include binary data (such as whether a user viewed a movie, whether they purchased a product, and so on) as well as count data (such as the number of times a user watched a movie). There are many different approaches to deal with implicit data. MLlib implements a particular approach that treats the input rating matrix as two matrices: a binary preference matrix, P, and a matrix of confidence weights, C. For example, let's assume that the user-movie ratings we saw previously were, in fact, the number of times each user had viewed that movie. The two matrices would look something like ones shown in the following screenshot. Here, the matrix P informs us that a movie was viewed by a user, and the matrix C represents the confidence weighting, in the form of the view counts—generally, the more a user has watched a movie, the higher the confidence that they actually like it. Representation of an implicit preference and confidence matrix The implicit model still creates a user- and item-factor matrix. In this case, however, the matrix that the model is attempting to approximate is not the overall ratings matrix but the preference matrix P. If we compute a recommendation by calculating the dot product of a user- and item-factor vector, the score will not be an estimate of a rating directly. It will rather be an estimate of the preference of a user for an item (though not strictly between 0 and 1, these scores will generally be fairly close to a scale of 0 to 1). Alternating least squares Alternating Least Squares (ALS) is an optimization technique to solve matrix factorization problems; this technique is powerful, achieves good performance, and has proven to be relatively easy to implement in a parallel fashion. Hence, it is well suited for platforms such as Spark. At the time of writing this, it is the only recommendation model implemented in MLlib. ALS works by iteratively solving a series of least squares regression problems. In each iteration, one of the user- or item-factor matrices is treated as fixed, while the other one is updated using the fixed factor and the rating data. Then, the factor matrix that was solved for is, in turn, treated as fixed, while the other one is updated. This process continues until the model has converged (or for a fixed number of iterations). Spark's documentation for collaborative filtering contains references to the papers that underlie the ALS algorithms implemented each component of explicit and implicit data. You can view the documentation at http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html. Extracting the right features from your data In this section, we will use explicit rating data, without additional user or item metadata or other information related to the user-item interactions. Hence, the features that we need as inputs are simply the user IDs, movie IDs, and the ratings assigned to each user and movie pair. Extracting features from the MovieLens 100k dataset Start the Spark shell in the Spark base directory, ensuring that you provide enough memory via the –driver-memory option: >./bin/spark-shell –driver-memory 4g In this example, we will use the same MovieLens dataset. Use the directory in which you placed the MovieLens 100k dataset as the input path in the following code. First, let's inspect the raw ratings dataset: val rawData = sc.textFile("/PATH/ml-100k/u.data") rawData.first() You will see output similar to these lines of code: 14/03/30 11:42:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/03/30 11:42:41 WARN LoadSnappy: Snappy native library not loaded 14/03/30 11:42:41 INFO FileInputFormat: Total input paths to process : 1 14/03/30 11:42:41 INFO SparkContext: Starting job: first at <console>:15 14/03/30 11:42:41 INFO DAGScheduler: Got job 0 (first at <console>:15) with 1 output partitions (allowLocal=true) 14/03/30 11:42:41 INFO DAGScheduler: Final stage: Stage 0 (first at <console>:15) 14/03/30 11:42:41 INFO DAGScheduler: Parents of final stage: List() 14/03/30 11:42:41 INFO DAGScheduler: Missing parents: List() 14/03/30 11:42:41 INFO DAGScheduler: Computing the requested partition locally 14/03/30 11:42:41 INFO HadoopRDD: Input split: file:/Users/Nick/workspace/datasets/ml-100k/u.data:0+1979173 14/03/30 11:42:41 INFO SparkContext: Job finished: first at <console>:15, took 0.030533 s res0: String = 196  242  3  881250949 Recall that this dataset consisted of the user id, movie id, rating, timestamp fields separated by a tab ("t") character. We don't need the time when the rating was made to train our model, so let's simply extract the first three fields: val rawRatings = rawData.map(_.split("t").take(3)) We will first split each record on the "t" character, which gives us an Array[String] array. We will then use Scala's take function to keep only the first 3 elements of the array, which correspond to user id, movie id, and rating, respectively. We can inspect the first record of our new RDD by calling rawRatings.first(), which collects just the first record of the RDD back to the driver program. This will result in the following output: 14/03/30 12:24:00 INFO SparkContext: Starting job: first at <console>:21 14/03/30 12:24:00 INFO DAGScheduler: Got job 1 (first at <console>:21) with 1 output partitions (allowLocal=true) 14/03/30 12:24:00 INFO DAGScheduler: Final stage: Stage 1 (first at <console>:21) 14/03/30 12:24:00 INFO DAGScheduler: Parents of final stage: List() 14/03/30 12:24:00 INFO DAGScheduler: Missing parents: List() 14/03/30 12:24:00 INFO DAGScheduler: Computing the requested partition locally 14/03/30 12:24:00 INFO HadoopRDD: Input split: file:/Users/Nick/workspace/datasets/ml-100k/u.data:0+1979173 14/03/30 12:24:00 INFO SparkContext: Job finished: first at <console>:21, took 0.00391 s res6: Array[String] = Array(196, 242, 3) We will use Spark's MLlib library to train our model. Let's take a look at what methods are available for us to use and what input is required. First, import the ALS model from MLlib: import org.apache.spark.mllib.recommendation.ALS On the console, we can inspect the available methods on the ALS object using tab completion. Type in ALS. (note the dot) and then press the Tab key. You should see the autocompletion of the methods: ALS. asInstanceOf    isInstanceOf    main            toString        train           trainImplicit The method we want to use is train. If we type ALS.train and hit Enter, we will get an error. However, this error will tell us what the method signature looks like: ALS.train <console>:12: error: ambiguous reference to overloaded definition, both method train in object ALS of type (ratings: org.apache.spark.rdd.RDD[org.apache.spark.mllib.recommendation.Rating], rank: Int, iterations: Int)org.apache.spark.mllib.recommendation.MatrixFactorizationModel and  method train in object ALS of type (ratings: org.apache.spark.rdd.RDD[org.apache.spark.mllib.recommendation.Rating], rank: Int, iterations: Int, lambda: Double)org.apache.spark.mllib.recommendation.MatrixFactorizationModel match expected type ?               ALS.train                   ^ So, we can see that at a minimum, we need to provide the input arguments, ratings, rank, and iterations. The second method also requires an argument called lambda. We'll cover these three shortly, but let's take a look at the ratings argument. First, let's import the Rating class that it references and use a similar approach to find out what an instance of Rating requires, by typing in Rating() and hitting Enter: import org.apache.spark.mllib.recommendation.Rating Rating() <console>:13: error: not enough arguments for method apply: (user: Int, product: Int, rating: Double)org.apache.spark.mllib.recommendation.Rating in object Rating. Unspecified value parameters user, product, rating.               Rating()                     ^ As we can see from the preceding output, we need to provide the ALS model with an RDD that consists of Rating records. A Rating class, in turn, is just a wrapper around user id, movie id (called product here), and the actual rating arguments. We'll create our rating dataset using the map method and transforming the array of IDs and ratings into a Rating object: val ratings = rawRatings.map { case Array(user, movie, rating) => Rating(user.toInt, movie.toInt, rating.toDouble) } Notice that we need to use toInt or toDouble to convert the raw rating data (which was extracted as Strings from the text file) to Int or Double numeric inputs. Also, note the use of a case statement that allows us to extract the relevant variable names and use them directly (this saves us from having to use something like val user = ratings(0)). For more on Scala case statements and pattern matching as used here, take a look at http://docs.scala-lang.org/tutorials/tour/pattern-matching.html. We now have an RDD[Rating] that we can verify by calling: ratings.first() 14/03/30 12:32:48 INFO SparkContext: Starting job: first at <console>:24 14/03/30 12:32:48 INFO DAGScheduler: Got job 2 (first at <console>:24) with 1 output partitions (allowLocal=true) 14/03/30 12:32:48 INFO DAGScheduler: Final stage: Stage 2 (first at <console>:24) 14/03/30 12:32:48 INFO DAGScheduler: Parents of final stage: List() 14/03/30 12:32:48 INFO DAGScheduler: Missing parents: List() 14/03/30 12:32:48 INFO DAGScheduler: Computing the requested partition locally 14/03/30 12:32:48 INFO HadoopRDD: Input split: file:/Users/Nick/workspace/datasets/ml-100k/u.data:0+1979173 14/03/30 12:32:48 INFO SparkContext: Job finished: first at <console>:24, took 0.003752 s res8: org.apache.spark.mllib.recommendation.Rating = Rating(196,242,3.0) Training the recommendation model Once we have extracted these simple features from our raw data, we are ready to proceed with model training; MLlib takes care of this for us. All we have to do is provide the correctly-parsed input RDD we just created as well as our chosen model parameters. Training a model on the MovieLens 100k dataset We're now ready to train our model! The other inputs required for our model are as follows: rank: This refers to the number of factors in our ALS model, that is, the number of hidden features in our low-rank approximation matrices. Generally, the greater the number of factors, the better, but this has a direct impact on memory usage, both for computation and to store models for serving, particularly for large number of users or items. Hence, this is often a trade-off in real-world use cases. A rank in the range of 10 to 200 is usually reasonable. iterations: This refers to the number of iterations to run. While each iteration in ALS is guaranteed to decrease the reconstruction error of the ratings matrix, ALS models will converge to a reasonably good solution after relatively few iterations. So, we don't need to run for too many iterations in most cases (around 10 is often a good default). lambda: This parameter controls the regularization of our model. Thus, lambda controls over fitting. The higher the value of lambda, the more is the regularization applied. What constitutes a sensible value is very dependent on the size, nature, and sparsity of the underlying data, and as with almost all machine learning models, the regularization parameter is something that should be tuned using out-of-sample test data and cross-validation approaches. We'll use rank of 50, 10 iterations, and a lambda parameter of 0.01 to illustrate how to train our model: val model = ALS.train(ratings, 50, 10, 0.01) This returns a MatrixFactorizationModel object, which contains the user and item factors in the form of an RDD of (id, factor) pairs. These are called userFeatures and productFeatures, respectively. For example: model.userFeatures res14: org.apache.spark.rdd.RDD[(Int, Array[Double])] = FlatMappedRDD[659] at flatMap at ALS.scala:231 We can see that the factors are in the form of an Array[Double]. Note that the operations used in MLlib's ALS implementation are lazy transformations, so the actual computation will only be performed once we call some sort of action on the resulting RDDs of the user and item factors. We can force the computation using a Spark action such as count: model.userFeatures.count This will trigger the computation, and we will see a quite a bit of output text similar to the following lines of code: 14/03/30 13:10:40 INFO SparkContext: Starting job: count at <console>:26 14/03/30 13:10:40 INFO DAGScheduler: Registering RDD 665 (map at ALS.scala:147) 14/03/30 13:10:40 INFO DAGScheduler: Registering RDD 664 (map at ALS.scala:146) 14/03/30 13:10:40 INFO DAGScheduler: Registering RDD 674 (mapPartitionsWithIndex at ALS.scala:164) ... 14/03/30 13:10:45 INFO SparkContext: Job finished: count at <console>:26, took 5.068255 s res16: Long = 943 If we call count for the movie factors, we will see the following output: model.productFeatures.count 14/03/30 13:15:21 INFO SparkContext: Starting job: count at <console>:26 14/03/30 13:15:21 INFO DAGScheduler: Got job 10 (count at <console>:26) with 1 output partitions (allowLocal=false) 14/03/30 13:15:21 INFO DAGScheduler: Final stage: Stage 165 (count at <console>:26) 14/03/30 13:15:21 INFO DAGScheduler: Parents of final stage: List(Stage 169, Stage 166) 14/03/30 13:15:21 INFO DAGScheduler: Missing parents: List() 14/03/30 13:15:21 INFO DAGScheduler: Submitting Stage 165 (FlatMappedRDD[883] at flatMap at ALS.scala:231), which has no missing parents 14/03/30 13:15:21 INFO DAGScheduler: Submitting 1 missing tasks from Stage 165 (FlatMappedRDD[883] at flatMap at ALS.scala:231) ... 14/03/30 13:15:21 INFO SparkContext: Job finished: count at <console>:26, took 0.030044 s res21: Long = 1682 As expected, we have a factor array for each user (943 factors) and movie (1682 factors). Training a model using implicit feedback data The standard matrix factorization approach in MLlib deals with explicit ratings. To work with implicit data, you can use the trainImplicit method. It is called in a manner similar to the standard train method. There is an additional parameter, alpha, that can be set (and in the same way, the regularization parameter, lambda, should be selected via testing and cross-validation methods). The alpha parameter controls the baseline level of confidence weighting applied. A higher level of alpha tends to make the model more confident about the fact that missing data equates to no preference for the relevant user-item pair. As an exercise, try to take the existing MovieLens dataset and convert it into an implicit dataset. One possible approach is to convert it to binary feedback (0s and 1s) by applying a threshold on the ratings at some level. Another approach could be to convert the ratings' values into confidence weights (for example, perhaps, low ratings could imply zero weights, or even negative weights, which are supported by MLlib's implementation). Train a model on this dataset and compare the results of the following section with those generated by your implicit model. Using the recommendation model Now that we have our trained model, we're ready to use it to make predictions. These predictions typically take one of two forms: recommendations for a given user and related or similar items for a given item. User recommendations In this case, we would like to generate recommended items for a given user. This usually takes the form of a top-K list, that is, the K items that our model predicts will have the highest probability of the user liking them. This is done by computing the predicted score for each item and ranking the list based on this score. The exact method to perform this computation depends on the model involved. For example, in user-based approaches, the ratings of similar users on items are used to compute the recommendations for a user, while in an item-based approach, the computation is based on the similarity of items the user has rated to the candidate items. In matrix factorization, because we are modeling the ratings matrix directly, the predicted score can be computed as the vector dot product between a user-factor vector and an item-factor vector. Generating movie recommendations from the MovieLens 100k dataset As MLlib's recommendation model is based on matrix factorization, we can use the factor matrices computed by our model to compute predicted scores (or ratings) for a user. We will focus on the explicit rating case using MovieLens data; however, the approach is the same when using the implicit model. The MatrixFactorizationModel class has a convenient predict method that will compute a predicted score for a given user and item combination: val predictedRating = model.predict(789, 123) 14/03/30 16:10:10 INFO SparkContext: Starting job: lookup at MatrixFactorizationModel.scala:45 14/03/30 16:10:10 INFO DAGScheduler: Got job 30 (lookup at MatrixFactorizationModel.scala:45) with 1 output partitions (allowLocal=false) ... 14/03/30 16:10:10 INFO SparkContext: Job finished: lookup at MatrixFactorizationModel.scala:46, took 0.023077 s predictedRating: Double = 3.128545693368485 As we can see, this model predicts a rating of 3.12 for user 789 and movie 123. Note that you might see different results than those shown in this section because the ALS model is initialized randomly. So, different runs of the model will lead to different solutions.  The predict method can also take an RDD of (user, item) IDs as the input and will generate predictions for each of these. We can use this method to make predictions for many users and items at the same time. To generate the top-K recommended items for a user, MatrixFactorizationModel provides a convenience method called recommendProducts. This takes two arguments: user and num, where user is the user ID, and num is the number of items to recommend. It returns the top num items ranked in the order of the predicted score. Here, the scores are computed as the dot product between the user-factor vector and each item-factor vector. Let's generate the top 10 recommended items for user 789: val userId = 789 val K = 10 val topKRecs = model.recommendProducts(userId, K) We now have a set of predicted ratings for each movie for user 789. If we print this out, we could inspect the top 10 recommendations for this user: println(topKRecs.mkString("n")) You should see the following output on your console: Rating(789,715,5.931851273771102) Rating(789,12,5.582301095666215) Rating(789,959,5.516272981542168) Rating(789,42,5.458065302395629) Rating(789,584,5.449949837103569) Rating(789,750,5.348768847643657) Rating(789,663,5.30832117499004) Rating(789,134,5.278933936827717) Rating(789,156,5.250959077906759) Rating(789,432,5.169863417126231) Inspecting the recommendations We can give these recommendations a sense check by taking a quick look at the titles of the movies a user has rated and the recommended movies. First, we need to load the movie data. We'll collect this data as a Map[Int, String] method mapping the movie ID to the title: val movies = sc.textFile("/PATH/ml-100k/u.item") val titles = movies.map(line => line.split("\|").take(2)).map(array => (array(0).toInt,  array(1))).collectAsMap() titles(123) res68: String = Frighteners, The (1996) For our user 789, we can find out what movies they have rated, take the 10 movies with the highest rating, and then check the titles. We will do this now by first using the keyBy Spark function to create an RDD of key-value pairs from our ratings RDD, where the key will be the user ID. We will then use the lookup function to return just the ratings for this key (that is, that particular user ID) to the driver: val moviesForUser = ratings.keyBy(_.user).lookup(789) Let's see how many movies this user has rated. This will be the size of the moviesForUser collection: println(moviesForUser.size) We will see that this user has rated 33 movies. Next, we will take the 10 movies with the highest ratings by sorting the moviesForUser collection using the rating field of the Rating object. We will then extract the movie title for the relevant product ID attached to the Rating class from our mapping of movie titles and print out the top 10 titles with their ratings: moviesForUser.sortBy(-_.rating).take(10).map(rating => (titles(rating.product), rating.rating)).foreach(println) You will see the following output displayed: (Godfather, The (1972),5.0) (Trainspotting (1996),5.0) (Dead Man Walking (1995),5.0) (Star Wars (1977),5.0) (Swingers (1996),5.0) (Leaving Las Vegas (1995),5.0) (Bound (1996),5.0) (Fargo (1996),5.0) (Last Supper, The (1995),5.0) (Private Parts (1997),4.0) Now, let's take a look at the top 10 recommendations for this user and see what the titles are using the same approach as the one we used earlier (note that the recommendations are already sorted): topKRecs.map(rating => (titles(rating.product), rating.rating)).foreach(println) (To Die For (1995),5.931851273771102) (Usual Suspects, The (1995),5.582301095666215) (Dazed and Confused (1993),5.516272981542168) (Clerks (1994),5.458065302395629) (Secret Garden, The (1993),5.449949837103569) (Amistad (1997),5.348768847643657) (Being There (1979),5.30832117499004) (Citizen Kane (1941),5.278933936827717) (Reservoir Dogs (1992),5.250959077906759) (Fantasia (1940),5.169863417126231) We leave it to you to decide whether these recommendations make sense. Item recommendations Item recommendations are about answering the following question: for a certain item, what are the items most similar to it? Here, the precise definition of similarity is dependent on the model involved. In most cases, similarity is computed by comparing the vector representation of two items using some similarity measure. Common similarity measures include Pearson correlation and cosine similarity for real-valued vectors and Jaccard similarity for binary vectors. Generating similar movies for the MovieLens 100K dataset The current MatrixFactorizationModel API does not directly support item-to-item similarity computations. Therefore, we will need to create our own code to do this. We will use the cosine similarity metric, and we will use the jblas linear algebra library (a dependency of MLlib) to compute the required vector dot products. This is similar to how the existing predict and recommendProducts methods work, except that we will use cosine similarity as opposed to just the dot product. We would like to compare the factor vector of our chosen item with each of the other items, using our similarity metric. In order to perform linear algebra computations, we will first need to create a vector object out of the factor vectors, which are in the form of an Array[Double]. The JBLAS class, DoubleMatrix, takes an Array[Double] as the constructor argument as follows: import org.jblas.DoubleMatrix val aMatrix = new DoubleMatrix(Array(1.0, 2.0, 3.0)) aMatrix: org.jblas.DoubleMatrix = [1.000000; 2.000000; 3.000000] Note that using jblas, vectors are represented as a one-dimensional DoubleMatrix class, while matrices are a two-dimensional DoubleMatrix class. We will need a method to compute the cosine similarity between two vectors. Cosine similarity is a measure of the angle between two vectors in an n-dimensional space. It is computed by first calculating the dot product between the vectors and then dividing the result by a denominator, which is the norm (or length) of each vector multiplied together (specifically, the L2-norm is used in cosine similarity). In this way, cosine similarity is a normalized dot product. The cosine similarity measure takes on values between -1 and 1. A value of 1 implies completely similar, while a value of 0 implies independence (that is, no similarity). This measure is useful because it also captures negative similarity, that is, a value of -1 implies that not only are the vectors not similar, but they are also completely dissimilar. Let's create our cosineSimilarity function here: def cosineSimilarity(vec1: DoubleMatrix, vec2: DoubleMatrix): Double = {   vec1.dot(vec2) / (vec1.norm2() * vec2.norm2()) } Note that we defined a return type for this function of Double. We are not required to do this, since Scala features type inference. However, it can often be useful to document return types for Scala functions. Let's try it out on one of our item factors for item 567. We will need to collect an item factor from our model; we will do this using the lookup method in a similar way that we did earlier to collect the ratings for a specific user. In the following lines of code, we also use the head function, since lookup returns an array of values, and we only need the first value (in fact, there will only be one value, which is the factor vector for this item). Since this will be an Array[Double], we will then need to create a DoubleMatrix object from it and compute the cosine similarity with itself: val itemId = 567 val itemFactor = model.productFeatures.lookup(itemId).head val itemVector = new DoubleMatrix(itemFactor) cosineSimilarity(itemVector, itemVector) A similarity metric should measure how close, in some sense, two vectors are to each other. Here, we can see that our cosine similarity metric tells us that this item vector is identical to itself, which is what we would expect: res113: Double = 1.0 Now, we are ready to apply our similarity metric to each item: val sims = model.productFeatures.map{ case (id, factor) =>  val factorVector = new DoubleMatrix(factor)   val sim = cosineSimilarity(factorVector, itemVector)   (id, sim) } Next, we can compute the top 10 most similar items by sorting out the similarity score for each item: // recall we defined K = 10 earlier val sortedSims = sims.top(K)(Ordering.by[(Int, Double), Double] { case (id, similarity) => similarity }) In the preceding code snippet, we used Spark's top function, which is an efficient way to compute top-K results in a distributed fashion, instead of using collect to return all the data to the driver and sorting it locally (remember that we could be dealing with millions of users and items in the case of recommendation models). We need to tell Spark how to sort the (item id, similarity score) pairs in the sims RDD. To do this, we will pass an extra argument to top, which is a Scala Ordering object that tells Spark that it should sort by the value in the key-value pair (that is, sort by similarity). Finally, we can print the 10 items with the highest computed similarity metric to our given item: println(sortedSims.take(10).mkString("n")) You will see output like the following one: (567,1.0000000000000002) (1471,0.6932331537649621) (670,0.6898690594544726) (201,0.6897964975027041) (343,0.6891221044611473) (563,0.6864214133620066) (294,0.6812075443259535) (413,0.6754663844488256) (184,0.6702643811753909) (109,0.6594872765176396) Not surprisingly, we can see that the top-ranked similar item is our item. The rest are the other items in our set of items, ranked in order of our similarity metric. Inspecting the similar items Let's see what the title of our chosen movie is: println(titles(itemId)) Wes Craven's New Nightmare (1994) As we did for user recommendations, we can sense check our item-to-item similarity computations and take a look at the titles of the most similar movies. This time, we will take the top 11 so that we can exclude our given movie. So, we will take the numbers 1 to 11 in the list: val sortedSims2 = sims.top(K + 1)(Ordering.by[(Int, Double), Double] { case (id, similarity) => similarity }) sortedSims2.slice(1, 11).map{ case (id, sim) => (titles(id), sim) }.mkString("n") You will see the movie titles and scores displayed similar to this output: (Hideaway (1995),0.6932331537649621) (Body Snatchers (1993),0.6898690594544726) (Evil Dead II (1987),0.6897964975027041) (Alien: Resurrection (1997),0.6891221044611473) (Stephen King's The Langoliers (1995),0.6864214133620066) (Liar Liar (1997),0.6812075443259535) (Tales from the Crypt Presents: Bordello of Blood (1996),0.6754663844488256) (Army of Darkness (1993),0.6702643811753909) (Mystery Science Theater 3000: The Movie (1996),0.6594872765176396) (Scream (1996),0.6538249646863378) Once again note that you might see quite different results due to random model initialization. Now that you have computed similar items using cosine similarity, see if you can do the same with the user-factor vectors to compute similar users for a given user. Evaluating the performance of recommendation models How do we know whether the model we have trained is a good model? We need to be able to evaluate its predictive performance in some way. Evaluation metrics are measures of a model's predictive capability or accuracy. Some are direct measures of how well a model predicts the model's target variable (such as Mean Squared Error), while others are concerned with how well the model performs at predicting things that might not be directly optimized in the model but are often closer to what we care about in the real world (such as Mean average precision). Evaluation metrics provide a standardized way of comparing the performance of the same model with different parameter settings and of comparing performance across different models. Using these metrics, we can perform model selection to choose the best-performing model from the set of models we wish to evaluate. Here, we will show you how to calculate two common evaluation metrics used in recommender systems and collaborative filtering models: Mean Squared Error and Mean average precision at K. Mean Squared Error The Mean Squared Error (MSE) is a direct measure of the reconstruction error of the user-item rating matrix. It is also the objective function being minimized in certain models, specifically many matrix-factorization techniques, including ALS. As such, it is commonly used in explicit ratings settings. It is defined as the sum of the squared errors divided by the number of observations. The squared error, in turn, is the square of the difference between the predicted rating for a given user-item pair and the actual rating. We will use our user 789 as an example. Let's take the first rating for this user from the moviesForUser set of Ratings that we previously computed: val actualRating = moviesForUser.take(1)(0) actualRating: org.apache.spark.mllib.recommendation.Rating = Rating(789,1012,4.0) We will see that the rating for this user-item combination is 4. Next, we will compute the model's predicted rating: val predictedRating = model.predict(789, actualRating.product) ... 14/04/13 13:01:15 INFO SparkContext: Job finished: lookup at MatrixFactorizationModel.scala:46, took 0.025404 s predictedRating: Double = 4.001005374200248 We will see that the predicted rating is about 4, very close to the actual rating. Finally, we will compute the squared error between the actual rating and the predicted rating: val squaredError = math.pow(predictedRating - actualRating.rating, 2.0) squaredError: Double = 1.010777282523947E-6 So, in order to compute the overall MSE for the dataset, we need to compute this squared error for each (user, movie, actual rating, predicted rating) entry, sum them up, and divide them by the number of ratings. We will do this in the following code snippet. Note the following code is adapted from the Apache Spark programming guide for ALS at http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html. First, we will extract the user and product IDs from the ratings RDD and make predictions for each user-item pair using model.predict. We will use the user-item pair as the key and the predicted rating as the value: val usersProducts = ratings.map{ case Rating(user, product, rating)  => (user, product)} val predictions = model.predict(usersProducts).map{     case Rating(user, product, rating) => ((user, product), rating) } Next, we extract the actual ratings and also map the ratings RDD so that the user-item pair is the key and the actual rating is the value. Now that we have two RDDs with the same form of key, we can join them together to create a new RDD with the actual and predicted ratings for each user-item combination: val ratingsAndPredictions = ratings.map{   case Rating(user, product, rating) => ((user, product), rating) }.join(predictions) Finally, we will compute the MSE by summing up the squared errors using reduce and dividing by the count method of the number of records: val MSE = ratingsAndPredictions.map{     case ((user, product), (actual, predicted)) =>  math.pow((actual - predicted), 2) }.reduce(_ + _) / ratingsAndPredictions.count println("Mean Squared Error = " + MSE) Mean Squared Error = 0.08231947642632852 It is common to use the Root Mean Squared Error (RMSE), which is just the square root of the MSE metric. This is somewhat more interpretable, as it is in the same units as the underlying data (that is, the ratings in this case). It is equivalent to the standard deviation of the differences between the predicted and actual ratings. We can compute it simply as follows: val RMSE = math.sqrt(MSE) println("Root Mean Squared Error = " + RMSE) Root Mean Squared Error = 0.2869137090247319 Mean average precision at K Mean average precision at K (MAPK) is the mean of the average precision at K (APK) metric across all instances in the dataset. APK is a metric commonly used in information retrieval. APK is a measure of the average relevance scores of a set of the top-K documents presented in response to a query. For each query instance, we will compare the set of top-K results with the set of actual relevant documents (that is, a ground truth set of relevant documents for the query). In the APK metric, the order of the result set matters, in that, the APK score would be higher if the result documents are both relevant and the relevant documents are presented higher in the results. It is, thus, a good metric for recommender systems in that typically we would compute the top-K recommended items for each user and present these to the user. Of course, we prefer models where the items with the highest predicted scores (which are presented at the top of the list of recommendations) are, in fact, the most relevant items for the user. APK and other ranking-based metrics are also more appropriate evaluation measures for implicit datasets; here, MSE makes less sense. In order to evaluate our model, we can use APK, where each user is the equivalent of a query, and the set of top-K recommended items is the document result set. The relevant documents (that is, the ground truth) in this case, is the set of items that a user interacted with. Hence, APK attempts to measure how good our model is at predicting items that a user will find relevant and choose to interact with. The code for the following average precision computation is based on https://github.com/benhamner/Metrics.  More information on MAPK can be found at https://www.kaggle.com/wiki/MeanAveragePrecision. Our function to compute the APK is shown here: def avgPrecisionK(actual: Seq[Int], predicted: Seq[Int], k: Int): Double = {   val predK = predicted.take(k)   var score = 0.0   var numHits = 0.0   for ((p, i) <- predK.zipWithIndex) {     if (actual.contains(p)) {       numHits += 1.0       score += numHits / (i.toDouble + 1.0)     }   }   if (actual.isEmpty) {     1.0   } else {     score / scala.math.min(actual.size, k).toDouble   } } As you can see, this takes as input a list of actual item IDs that are associated with the user and another list of predicted ids so that our estimate will be relevant for the user. We can compute the APK metric for our example user 789 as follows. First, we will extract the actual movie IDs for the user: val actualMovies = moviesForUser.map(_.product) actualMovies: Seq[Int] = ArrayBuffer(1012, 127, 475, 93, 1161, 286, 293, 9, 50, 294, 181, 1, 1008, 508, 284, 1017, 137, 111, 742, 248, 249, 1007, 591, 150, 276, 151, 129, 100, 741, 288, 762, 628, 124) We will then use the movie recommendations we made previously to compute the APK score using K = 10: val predictedMovies = topKRecs.map(_.product) predictedMovies: Array[Int] = Array(27, 497, 633, 827, 602, 849, 401, 584, 1035, 1014) val apk10 = avgPrecisionK(actualMovies, predictedMovies, 10) apk10: Double = 0.0 In this case, we can see that our model is not doing a very good job of predicting relevant movies for this user as the APK score is 0. In order to compute the APK for each user and average them to compute the overall MAPK, we will need to generate the list of recommendations for each user in our dataset. While this can be fairly intensive on a large scale, we can distribute the computation using our Spark functionality. However, one limitation is that each worker must have the full item-factor matrix available so that it can compute the dot product between the relevant user vector and all item vectors. This can be a problem when the number of items is extremely high as the item matrix must fit in the memory of one machine. There is actually no easy way around this limitation. One possible approach is to only compute recommendations for a subset of items from the total item set, using approximate techniques such as Locality Sensitive Hashing (http://en.wikipedia.org/wiki/Locality-sensitive_hashing). We will now see how to go about this. First, we will collect the item factors and form a DoubleMatrix object from them: val itemFactors = model.productFeatures.map { case (id, factor) => factor }.collect() val itemMatrix = new DoubleMatrix(itemFactors) println(itemMatrix.rows, itemMatrix.columns) (1682,50) This gives us a matrix with 1682 rows and 50 columns, as we would expect from 1682 movies with a factor dimension of 50. Next, we will distribute the item matrix as a broadcast variable so that it is available on each worker node: val imBroadcast = sc.broadcast(itemMatrix) 14/04/13 21:02:01 INFO MemoryStore: ensureFreeSpace(672960) called with curMem=4006896, maxMem=311387750 14/04/13 21:02:01 INFO MemoryStore: Block broadcast_21 stored as values to memory (estimated size 657.2 KB, free 292.5 MB) imBroadcast: org.apache.spark.broadcast.Broadcast[org.jblas.DoubleMatrix] = Broadcast(21) Now we are ready to compute the recommendations for each user. We will do this by applying a map function to each user factor within which we will perform a matrix multiplication between the user-factor vector and the movie-factor matrix. The result is a vector (of length 1682, that is, the number of movies we have) with the predicted rating for each movie. We will then sort these predictions by the predicted rating: val allRecs = model.userFeatures.map{ case (userId, array) =>   val userVector = new DoubleMatrix(array)   val scores = imBroadcast.value.mmul(userVector)   val sortedWithId = scores.data.zipWithIndex.sortBy(-_._1)   val recommendedIds = sortedWithId.map(_._2 + 1).toSeq   (userId, recommendedIds) } allRecs: org.apache.spark.rdd.RDD[(Int, Seq[Int])] = MappedRDD[269] at map at <console>:29 As we can see, we now have an RDD that contains a list of movie IDs for each user ID. These movie IDs are sorted in order of the estimated rating. Note that we needed to add 1 to the returned movie ids (as highlighted in the preceding code snippet), as the item-factor matrix is 0-indexed, while our movie IDs start at 1. We also need the list of movie IDs for each user to pass into our APK function as the actual argument. We already have the ratings RDD ready, so we can extract just the user and movie IDs from it. If we use Spark's groupBy operator, we will get an RDD that contains a list of (userid, movieid) pairs for each user ID (as the user ID is the key on which we perform the groupBy operation): val userMovies = ratings.map{ case Rating(user, product, rating) => (user, product) }.groupBy(_._1) userMovies: org.apache.spark.rdd.RDD[(Int, Seq[(Int, Int)])] = MapPartitionsRDD[277] at groupBy at <console>:21 Finally, we can use Spark's join operator to join these two RDDs together on the user ID key. Then, for each user, we have the list of actual and predicted movie IDs that we can pass to our APK function. In a manner similar to how we computed MSE, we will sum each of these APK scores using a reduce action and divide by the number of users (that is, the count of the allRecs RDD): val K = 10 val MAPK = allRecs.join(userMovies).map{ case (userId, (predicted, actualWithIds)) =>   val actual = actualWithIds.map(_._2).toSeq   avgPrecisionK(actual, predicted, K) }.reduce(_ + _) / allRecs.count println("Mean Average Precision at K = " + MAPK) Mean Average Precision at K = 0.030486963254725705 Our model achieves a fairly low MAPK. However, note that typical values for recommendation tasks are usually relatively low, especially if the item set is extremely large. Try out a few parameter settings for lambda and rank (and alpha if you are using the implicit version of ALS) and see whether you can find a model that performs better based on the RMSE and MAPK evaluation metrics. Using MLlib's built-in evaluation functions While we have computed MSE, RMSE, and MAPK from scratch, and it a useful learning exercise to do so, MLlib provides convenience functions to do this for us in the RegressionMetrics and RankingMetrics classes. RMSE and MSE First, we will compute the MSE and RMSE metrics using RegressionMetrics. We will instantiate a RegressionMetrics instance by passing in an RDD of key-value pairs that represent the predicted and true values for each data point, as shown in the following code snippet. Here, we will again use the ratingsAndPredictions RDD we computed in our earlier example: import org.apache.spark.mllib.evaluation.RegressionMetrics val predictedAndTrue = ratingsAndPredictions.map { case ((user, product), (predicted, actual)) => (predicted, actual) } val regressionMetrics = new RegressionMetrics(predictedAndTrue) We can then access various metrics, including MSE and RMSE. We will print out these metrics here: println("Mean Squared Error = " + regressionMetrics.meanSquaredError) println("Root Mean Squared Error = " + regressionMetrics.rootMeanSquaredError) You will see that the output for MSE and RMSE is exactly the same as the metrics we computed earlier: Mean Squared Error = 0.08231947642632852 Root Mean Squared Error = 0.2869137090247319 MAP As we did for MSE and RMSE, we can compute ranking-based evaluation metrics using MLlib's RankingMetrics class. Similarly, to our own average precision function, we need to pass in an RDD of key-value pairs, where the key is an Array of predicted item IDs for a user, while the value is an array of actual item IDs. The implementation of the average precision at the K function in RankingMetrics is slightly different from ours, so we will get different results. However, the computation of the overall mean average precision (MAP, which does not use a threshold at K) is the same as our function if we select K to be very high (say, at least as high as the number of items in our item set): First, we will calculate MAP using RankingMetrics: import org.apache.spark.mllib.evaluation.RankingMetrics val predictedAndTrueForRanking = allRecs.join(userMovies).map{ case (userId, (predicted, actualWithIds)) =>   val actual = actualWithIds.map(_._2)   (predicted.toArray, actual.toArray) } val rankingMetrics = new RankingMetrics(predictedAndTrueForRanking) println("Mean Average Precision = " + rankingMetrics.meanAveragePrecision) You will see the following output: Mean Average Precision = 0.07171412913757183 Next, we will use our function to compute the MAP in exactly the same way as we did previously, except that we set K to a very high value, say 2000: val MAPK2000 = allRecs.join(userMovies).map{ case (userId, (predicted, actualWithIds)) =>   val actual = actualWithIds.map(_._2).toSeq   avgPrecisionK(actual, predicted, 2000) }.reduce(_ + _) / allRecs.count println("Mean Average Precision = " + MAPK2000) You will see that the MAP from our own function is the same as the one computed using RankingMetrics: Mean Average Precision = 0.07171412913757186 We will not cover cross validation in this article. However, note that the same techniques for cross-validation can be used to evaluate recommendation models, using the performance metrics such as MSE, RMSE, and MAP, which we covered in this section. Summary In this article, we used Spark's MLlib library to train a collaborative filtering recommendation model, and you learned how to use this model to make predictions for the items that a given user might have a preference for. We also used our model to find items that are similar or related to a given item. Finally, we explored common metrics to evaluate the predictive capability of our recommendation model. To learn more about Spark, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Fast Data Processing with Spark - Second Edition (https://www.packtpub.com/big-data-and-business-intelligence/fast-data-processing-spark-second-edition) Spark Cookbook (https://www.packtpub.com/big-data-and-business-intelligence/spark-cookbook) Resources for Article: Further resources on this subject: Reactive Programming And The Flux Architecture [article] Spark - Architecture And First Program [article] The Design Patterns Out There And Setting Up Your Environment [article]
Read more
  • 0
  • 1
  • 16207

article-image-installing-configuring-x-pack-elasticsearch-kibana
Pravin Dhandre
20 Feb 2018
6 min read
Save for later

Installing and Configuring X-pack on Elasticsearch and Kibana

Pravin Dhandre
20 Feb 2018
6 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Pranav Shukla and Sharath Kumar M N titled Learning Elastic Stack 6.0. This book provides detailed coverage on fundamentals of Elastic Stack, making it easy to search, analyze and visualize data across different sources in real-time.[/box] In this short tutorial, we will show step-by-step installation and configuration of X-pack components in Elastic Stack to extend the functionalities of Elasticsearch and Kibana. As X-Pack is an extension of Elastic Stack, prior to installing X-Pack, you need to have both Elasticsearch and Kibana installed. You must run the version of X-Pack that matches the version of Elasticsearch and Kibana. Installing X-Pack on Elasticsearch X-Pack is installed just like any plugin to extend Elasticsearch. These are the steps to install X-Pack in Elasticsearch: Navigate to the ES_HOME folder. Install X-Pack using the following command: $ ES_HOME> bin/elasticsearch-plugin install x-pack During installation, it will ask you to grant extra permissions to X-Pack, which are required by Watcher to send email alerts and also to enable Elasticsearch to launch the machine learning analytical engine. Specify y to continue the installation or N to abort the installation. You should get the following logs/prompts during installation: -> Downloading x-pack from elastic [=================================================] 100% @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: plugin requires additional permissions @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ * java.io.FilePermission .pipe* read,write * java.lang.RuntimePermissionaccessClassInPackage.com.sun.activation.registries * java.lang.RuntimePermission getClassLoader * java.lang.RuntimePermission setContextClassLoader * java.lang.RuntimePermission setFactory * java.net.SocketPermission * connect,accept,resolve * java.security.SecurityPermission createPolicy.JavaPolicy * java.security.SecurityPermission getPolicy * java.security.SecurityPermission putProviderProperty.BC * java.security.SecurityPermission setPolicy * java.util.PropertyPermission * read,write * java.util.PropertyPermission sun.nio.ch.bugLevel write See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html for descriptions of what these permissions allow and the associated Risks. Continue with installation? [y/N]y @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: plugin forks a native controller @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ This plugin launches a native controller that is not subject to the Java security manager nor to system call filters. Continue with installation? [y/N]y Elasticsearch keystore is required by plugin [x-pack], creating... -> Installed x-pack Restart Elasticsearch: $ ES_HOME> bin/elasticsearch Generate the passwords for the default/reserved users—elastic, kibana, and logstash_system—by executing this command: $ ES_HOME>bin/x-pack/setup-passwords interactive You should get the following logs/prompts to enter the password for the reserved/default users: Initiating the setup of reserved user elastic,kibana,logstash_system passwords. You will be prompted to enter passwords as the process progresses. Please confirm that you would like to continue [y/N]y Enter password for [elastic]: elastic Reenter password for [elastic]: elastic Enter password for [kibana]: kibana Reenter password for [kibana]:kibana Enter password for [logstash_system]: logstash Reenter password for [logstash_system]: logstash Changed password for user [kibana] Changed password for user [logstash_system] Changed password for user [elastic] Please make a note of the passwords set for the reserved/default users. You can choose any password of your liking. We have chosen the passwords as elastic, kibana, and logstash for elastic, kibana, and logstash_system users, respectively, and we will be using them throughout this chapter. To verify the X-Pack installation and enforcement of security, point your web browser to http://localhost:9200/ to open Elasticsearch. You should be prompted to log in to Elasticsearch. To log in, you can use the built-in elastic user and the password elastic. Upon a successful log in, you should see the following response: { name: "fwDdHSI", cluster_name: "elasticsearch", cluster_uuid: "08wSPsjSQCmeRaxF4iHizw", version: { number: "6.0.0", build_hash: "8f0685b", build_date: "2017-11-10T18:41:22.859Z", build_snapshot: false, lucene_version: "7.0.1", minimum_wire_compatibility_version: "5.6.0", minimum_index_compatibility_version: "5.0.0" }, tagline: "You Know, for Search" } A typical cluster in Elasticsearch is made up of multiple nodes, and X-Pack needs to be installed on each node belonging to the cluster. To skip the install prompt, use the—batch parameters during installation: $ES_HOME>bin/elasticsearch-plugin install x-pack --batch. Your installation of X-Pack will have created folders named x-pack in bin, config, and plugins found under ES_HOME. We shall explore these in later sections of the chapter. Installing X-Pack on Kibana X-Pack is installed just like any plugins to extend Kibana. The following are the steps to install X-Pack in Kibana: Navigate to the KIBANA_HOME folder. Install X-Pack using the following command: $KIBANA_HOME>bin/kibana-plugin install x-pack You should get the following logs/prompts during installation: Attempting to transfer from x-pack Attempting to transfer from https://artifacts.elastic.co/downloads/kibana-plugins/x-pack/x-pack -6.0.0.zip Transferring 120307264 bytes.................... Transfer complete Retrieving metadata from plugin archive Extracting plugin archive Extraction complete Optimizing and caching browser bundles... Plugin installation complete Add the following credentials in the kibana.yml file found under $KIBANA_HOME/config and save it: elasticsearch.username: "kibana" elasticsearch.password: "kibana" If you have chosen a different password for the kibana user during password setup, use that value for the elasticsearch.password property. Start Kibana: $KIBANA_HOME>bin/kibana To verify the X-Pack installation, go to http://localhost:5601/ to open Kibana. You should be prompted to log in to Kibana. To log in, you can use the built-in elastic user and the password elastic. Your installation of X-Pack will have created a folder named x-pack in the plugins folder found under KIBANA_HOME. You can also optionally install X-Pack on Logstash. However, X-Pack currently supports only monitoring of Logstash. Uninstalling X-Pack To uninstall X-Pack: Stop Elasticsearch. Remove X-Pack from Elasticsearch: $ES_HOME>bin/elasticsearch-plugin remove x-pack Restart Elasticsearch and stop Kibana 2. Remove X-Pack from Kibana: $KIBANA_HOME>bin/kibana-plugin remove x-pack Restart Kibana. Configuring X-Pack X-Pack comes bundled with security, alerting, monitoring, reporting, machine learning, and graph capabilities. By default, all of these features are enabled. However, one might not be interested in all the features it provides. One can selectively enable and disable the features that they are interested in from the elasticsearch.yml and kibana.yml configuration files. Elasticsearch supports the following features and settings in the elasticsearch.yml file: Kibana supports these features and settings in the kibana.yml file: If X-Pack is installed on Logstash, you can disable the monitoring by setting the xpack.monitoring.enabled property to false in the logstash.yml configuration file.   With this, we successfully explored how to install and configure the X-Pack components in order to bundle different capabilities of X-pack into one package of Elasticsearch and Kibana. If you found this tutorial useful, do check out the book Learning Elastic Stack 6.0 to examine the fundamentals of Elastic Stack in detail and start developing solutions for problems like logging, site search, app search, metrics and more.    
Read more
  • 0
  • 0
  • 16098
article-image-use-tensorflow-and-nlp-to-detect-duplicate-quora-questions-tutorial
Sunith Shetty
21 Jun 2018
32 min read
Save for later

Use TensorFlow and NLP to detect duplicate Quora questions [Tutorial]

Sunith Shetty
21 Jun 2018
32 min read
This tutorial shows how to build an NLP project with TensorFlow that explicates the semantic similarity between sentences using the Quora dataset. It is based on the work of Abhishek Thakur, who originally developed a solution on the Keras package. This article is an excerpt from a book written by Luca Massaron, Alberto Boschetti, Alexey Grigorev, Abhishek Thakur, and Rajalingappaa Shanmugamani titled TensorFlow Deep Learning Projects. Presenting the dataset The data, made available for non-commercial purposes (https://www.quora.com/about/tos) in a Kaggle competition (https://www.kaggle.com/c/quora-question-pairs) and on Quora's blog (https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs), consists of 404,351 question pairs with 255,045 negative samples (non-duplicates) and 149,306 positive samples (duplicates). There are approximately 40% positive samples, a slight imbalance that won't need particular corrections. Actually, as reported on the Quora blog, given their original sampling strategy, the number of duplicated examples in the dataset was much higher than the non-duplicated ones. In order to set up a more balanced dataset, the negative examples were upsampled by using pairs of related questions, that is, questions about the same topic that are actually not similar. Before starting work on this project, you can simply directly download the data, which is about 55 MB, from its Amazon S3 repository at this link into our working directory. After loading it, we can start diving directly into the data by picking some example rows and examining them. The following diagram shows an actual snapshot of the few first rows from the dataset: Exploring further into the data, we can find some examples of question pairs that mean the same thing, that is, duplicates, as follows:  How does Quora quickly mark questions as needing improvement? Why does Quora mark my questions as needing improvement/clarification before I have time to give it details? Literally within seconds… Why did Trump win the Presidency? How did Donald Trump win the 2016 Presidential Election? What practical applications might evolve from the discovery of the Higgs Boson? What are some practical benefits of the discovery of the Higgs Boson? At first sight, duplicated questions have quite a few words in common, but they could be very different in length. On the other hand, examples of non-duplicate questions are as follows: Who should I address my cover letter to if I'm applying to a big company like Mozilla? Which car is better from a safety persepctive? swift or grand i10. My first priority is safety? Mr. Robot (TV series): Is Mr. Robot a good representation of real-life hacking and hacking culture? Is the depiction of hacker societies realistic? What mistakes are made when depicting hacking in Mr. Robot compared to real-life cyber security breaches or just a regular use of technologies? How can I start an online shopping (e-commerce) website? Which web technology is best suited for building a big e-commerce website? Some questions from these examples are clearly not duplicated and have few words in common, but some others are more difficult to detect as unrelated. For instance, the second pair in the example might turn to be appealing to some and leave even a human judge uncertain. The two questions might mean different things: why versus how, or they could be intended as the same from a superficial examination. Looking deeper, we may even find more doubtful examples and even some clear data mistakes; we surely have some anomalies in the dataset (as the Quota post on the dataset warned) but, given that the data is derived from a real-world problem, we can't do anything but deal with this kind of imperfection and strive to find a robust solution that works. At this point, our exploration becomes more quantitative than qualitative and some statistics on the question pairs are provided here: Average number of characters in question1 59.57 Minimum number of characters in question1 1 Maximum number of characters in question1 623 Average number of characters in question2 60.14 Minimum number of characters in question2 1 Maximum number of characters in question2 1169 Question 1 and question 2 are roughly the same average characters, though we have more extremes in question 2. There also must be some trash in the data, since we cannot figure out a question made up of a single character. We can even get a completely different vision of our data by plotting it into a word cloud and highlighting the most common words present in the dataset: Figure 1: A word cloud made up of the most frequent words to be found in the Quora dataset The presence of word sequences such as Hillary Clinton and Donald Trump reminds us that the data was gathered at a certain historical moment and that many questions we can find inside it are clearly ephemeral, reasonable only at the very time the dataset was collected. Other topics, such as programming language, World War, or earn money could be longer lasting, both in terms of interest and in the validity of the answers provided. After exploring the data a bit, it is now time to decide what target metric we will strive to optimize in our project. Throughout the article, we will be using accuracy as a metric to evaluate the performance of our models. Accuracy as a measure is simply focused on the effectiveness of the prediction, and it may miss some important differences between alternative models, such as discrimination power (is the model more able to detect duplicates or not?) or the exactness of probability scores (how much margin is there between being a duplicate and not being one?). We chose accuracy based on the fact that this metric was the one decided on by Quora's engineering team to create a benchmark for this dataset (as stated in this blog post of theirs: https://engineering.quora.com/Semantic-Question-Matching-with-Deep-Learning). Using accuracy as the metric makes it easier for us to evaluate and compare our models with the one from Quora's engineering team, and also several other research papers. In addition, in a real-world application, our work may simply be evaluated on the basis of how many times it is just right or wrong, regardless of other considerations. We can now proceed furthermore in our projects with some very basic feature engineering to start with. Starting with basic feature engineering Before starting to code, we have to load the dataset in Python and also provide Python with all the necessary packages for our project. We will need to have these packages installed on our system (the latest versions should suffice, no need for any specific package version): Numpy pandas fuzzywuzzy python-Levenshtein scikit-learn gensim pyemd NLTK As we will be using each one of these packages in the project, we will provide specific instructions and tips to install them. For all dataset operations, we will be using pandas (and Numpy will come in handy, too). To install numpy and pandas: pip install numpy pip install pandas The dataset can be loaded into memory easily by using pandas and a specialized data structure, the pandas dataframe (we expect the dataset to be in the same directory as your script or Jupyter notebook): import pandas as pd import numpy as np data = pd.read_csv('quora_duplicate_questions.tsv', sep='t') data = data.drop(['id', 'qid1', 'qid2'], axis=1) We will be using the pandas dataframe denoted by data , and also when we work with our TensorFlow model and provide input to it. We can now start by creating some very basic features. These basic features include length-based features and string-based features: Length of question1 Length of question2 Difference between the two lengths Character length of question1 without spaces Character length of question2 without spaces Number of words in question1 Number of words in question2 Number of common words in question1 and question2 These features are dealt with one-liners transforming the original input using the pandas package in Python and its method apply: # length based features data['len_q1'] = data.question1.apply(lambda x: len(str(x))) data['len_q2'] = data.question2.apply(lambda x: len(str(x))) # difference in lengths of two questions data['diff_len'] = data.len_q1 - data.len_q2 # character length based features data['len_char_q1'] = data.question1.apply(lambda x: len(''.join(set(str(x).replace(' ', ''))))) data['len_char_q2'] = data.question2.apply(lambda x: len(''.join(set(str(x).replace(' ', ''))))) # word length based features data['len_word_q1'] = data.question1.apply(lambda x: len(str(x).split())) data['len_word_q2'] = data.question2.apply(lambda x: len(str(x).split())) # common words in the two questions data['common_words'] = data.apply(lambda x: len(set(str(x['question1']) .lower().split()) .intersection(set(str(x['question2']) .lower().split()))), axis=1) For future reference, we will mark this set of features as feature set-1 or fs_1: fs_1 = ['len_q1', 'len_q2', 'diff_len', 'len_char_q1', 'len_char_q2', 'len_word_q1', 'len_word_q2', 'common_words'] This simple approach will help you to easily recall and combine a different set of features in the machine learning models we are going to build, turning comparing different models run by different feature sets into a piece of cake. Creating fuzzy features The next set of features are based on fuzzy string matching. Fuzzy string matching is also known as approximate string matching and is the process of finding strings that approximately match a given pattern. The closeness of a match is defined by the number of primitive operations necessary to convert the string into an exact match. These primitive operations include insertion (to insert a character at a given position), deletion (to delete a particular character), and substitution (to replace a character with a new one). Fuzzy string matching is typically used for spell checking, plagiarism detection, DNA sequence matching, spam filtering, and so on and it is part of the larger family of edit distances, distances based on the idea that a string can be transformed into another one. It is frequently used in natural language processing and other applications in order to ascertain the grade of difference between two strings of characters. It is also known as Levenshtein distance, from the name of the Russian scientist, Vladimir Levenshtein, who introduced it in 1965. These features were created using the fuzzywuzzy package available for Python (https://pypi.python.org/pypi/fuzzywuzzy). This package uses Levenshtein distance to calculate the differences in two sequences, which in our case are the pair of questions. The fuzzywuzzy package can be installed using pip3: pip install fuzzywuzzy As an important dependency, fuzzywuzzy requires the Python-Levenshtein package (https://github.com/ztane/python-Levenshtein/), which is a blazingly fast implementation of this classic algorithm, powered by compiled C code. To make the calculations much faster using fuzzywuzzy, we also need to install the Python-Levenshtein package: pip install python-Levenshtein The fuzzywuzzy package offers many different types of ratio, but we will be using only the following: QRatio WRatio Partial ratio Partial token set ratio Partial token sort ratio Token set ratio Token sort ratio Examples of fuzzywuzzy features on Quora data: from fuzzywuzzy import fuzz fuzz.QRatio("Why did Trump win the Presidency?", "How did Donald Trump win the 2016 Presidential Election") This code snippet will result in the value of 67 being returned: fuzz.QRatio("How can I start an online shopping (e-commerce) website?", "Which web technology is best suitable for building a big E-Commerce website?") In this comparison, the returned value will be 60. Given these examples, we notice that although the values of QRatio are close to each other, the value for the similar question pair from the dataset is higher than the pair with no similarity. Let's take a look at another feature from fuzzywuzzy for these same pairs of questions: fuzz.partial_ratio("Why did Trump win the Presidency?", "How did Donald Trump win the 2016 Presidential Election") In this case, the returned value is 73: fuzz.partial_ratio("How can I start an online shopping (e-commerce) website?", "Which web technology is best suitable for building a big E-Commerce website?") Now the returned value is 57. Using the partial_ratio method, we can observe how the difference in scores for these two pairs of questions increases notably, allowing an easier discrimination between being a duplicate pair or not. We assume that these features might add value to our models. By using pandas and the fuzzywuzzy package in Python, we can again apply these features as simple one-liners: data['fuzz_qratio'] = data.apply(lambda x: fuzz.QRatio( str(x['question1']), str(x['question2'])), axis=1) data['fuzz_WRatio'] = data.apply(lambda x: fuzz.WRatio( str(x['question1']), str(x['question2'])), axis=1) data['fuzz_partial_ratio'] = data.apply(lambda x: fuzz.partial_ratio(str(x['question1']), str(x['question2'])), axis=1) data['fuzz_partial_token_set_ratio'] = data.apply(lambda x: fuzz.partial_token_set_ratio(str(x['question1']), str(x['question2'])), axis=1) data['fuzz_partial_token_sort_ratio'] = data.apply(lambda x: fuzz.partial_token_sort_ratio(str(x['question1']), str(x['question2'])), axis=1) data['fuzz_token_set_ratio'] = data.apply(lambda x: fuzz.token_set_ratio(str(x['question1']), str(x['question2'])), axis=1) data['fuzz_token_sort_ratio'] = data.apply(lambda x: fuzz.token_sort_ratio(str(x['question1']), str(x['question2'])), axis=1) This set of features are henceforth denoted as feature set-2 or fs_2: fs_2 = ['fuzz_qratio', 'fuzz_WRatio', 'fuzz_partial_ratio', 'fuzz_partial_token_set_ratio', 'fuzz_partial_token_sort_ratio', 'fuzz_token_set_ratio', 'fuzz_token_sort_ratio'] Again, we will store our work and save it for later use when modeling. Resorting to TF-IDF and SVD features The next few sets of features are based on TF-IDF and SVD. Term Frequency-Inverse Document Frequency (TF-IDF). Is one of the algorithms at the foundation of information retrieval. Here, the algorithm is explained using a formula: You can understand the formula using this notation: C(t) is the number of times a term t appears in a document, N is the total number of terms in the document, this results in the Term Frequency (TF).  ND is the total number of documents and NDt is the number of documents containing the term t, this provides the Inverse Document Frequency (IDF).  TF-IDF for a term t is a multiplication of Term Frequency and Inverse Document Frequency for the given term t: Without any prior knowledge, other than about the documents themselves, such a score will highlight all the terms that could easily discriminate a document from the others, down-weighting the common words that won't tell you much, such as the common parts of speech (such as articles, for instance). If you need a more hands-on explanation of TFIDF, this great online tutorial will help you try coding the algorithm yourself and testing it on some text data: https://stevenloria.com/tf-idf/ For convenience and speed of execution, we resorted to the scikit-learn implementation of TFIDF.  If you don't already have scikit-learn installed, you can install it using pip: pip install -U scikit-learn We create TFIDF features for both question1 and question2 separately (in order to type less, we just deep copy the question1 TfidfVectorizer): from sklearn.feature_extraction.text import TfidfVectorizer from copy import deepcopy tfv_q1 = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', analyzer='word', token_pattern=r'w{1,}', ngram_range=(1, 2), use_idf=1, smooth_idf=1, sublinear_tf=1, stop_words='english') tfv_q2 = deepcopy(tfv_q1) It must be noted that the parameters shown here have been selected after quite a lot of experiments. These parameters generally work pretty well with all other problems concerning natural language processing, specifically text classification. One might need to change the stop word list to the language in question. We can now obtain the TFIDF matrices for question1 and question2 separately: q1_tfidf = tfv_q1.fit_transform(data.question1.fillna("")) q2_tfidf = tfv_q2.fit_transform(data.question2.fillna("")) In our TFIDF processing, we computed the TFIDF matrices based on all the data available (we used the fit_transform method). This is quite a common approach in Kaggle competitions because it helps to score higher on the leaderboard. However, if you are working in a real setting, you may want to exclude a part of the data as a training or validation set in order to be sure that your TFIDF processing helps your model to generalize to a new, unseen dataset. After we have the TFIDF features, we move to SVD features. SVD is a feature decomposition method and it stands for singular value decomposition. It is largely used in NLP because of a technique called Latent Semantic Analysis (LSA). A detailed discussion of SVD and LSA is beyond the scope of this article, but you can get an idea of their workings by trying these two approachable and clear online tutorials: https://alyssaq.github.io/2015/singular-value-decomposition-visualisation/ and https://technowiki.wordpress.com/2011/08/27/latent-semantic-analysis-lsa-tutorial/ To create the SVD features, we again use scikit-learn implementation. This implementation is a variation of traditional SVD and is known as TruncatedSVD. A TruncatedSVD is an approximate SVD method that can provide you with reliable yet computationally fast SVD matrix decomposition. You can find more hints about how this technique works and it can be applied by consulting this web page: http://langvillea.people.cofc.edu/DISSECTION-LAB/Emmie'sLSI-SVDModule/p5module.html from sklearn.decomposition import TruncatedSVD svd_q1 = TruncatedSVD(n_components=180) svd_q2 = TruncatedSVD(n_components=180) We chose 180 components for SVD decomposition and these features are calculated on a TF-IDF matrix: question1_vectors = svd_q1.fit_transform(q1_tfidf) question2_vectors = svd_q2.fit_transform(q2_tfidf) Feature set-3 is derived from a combination of these TF-IDF and SVD features. For example, we can have only the TF-IDF features for the two questions separately going into the model, or we can have the TF-IDF of the two questions combined with an SVD on top of them, and then the model kicks in, and so on. These features are explained as follows. Feature set-3(1) or fs3_1 is created using two different TF-IDFs for the two questions, which are then stacked together horizontally and passed to a machine learning model: This can be coded as: from scipy import sparse # obtain features by stacking the sparse matrices together fs3_1 = sparse.hstack((q1_tfidf, q2_tfidf)) Feature set-3(2), or fs3_2, is created by combining the two questions and using a single TF-IDF: tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', analyzer='word', token_pattern=r'w{1,}', ngram_range=(1, 2), use_idf=1, smooth_idf=1, sublinear_tf=1, stop_words='english') # combine questions and calculate tf-idf q1q2 = data.question1.fillna("") q1q2 += " " + data.question2.fillna("") fs3_2 = tfv.fit_transform(q1q2) The next subset of features in this feature set, feature set-3(3) or fs3_3, consists of separate TF-IDFs and SVDs for both questions: This can be coded as follows: # obtain features by stacking the matrices together fs3_3 = np.hstack((question1_vectors, question2_vectors)) We can similarly create a couple more combinations using TF-IDF and SVD, and call them fs3-4 and fs3-5, respectively. These are depicted in the following diagrams, but the code is left as an exercise for the reader. Feature set-3(4) or fs3-4: Feature set-3(5) or fs3-5: After the basic feature set and some TF-IDF and SVD features, we can now move to more complicated features before diving into the machine learning and deep learning models. Mapping with Word2vec embeddings Very broadly, Word2vec models are two-layer neural networks that take a text corpus as input and output a vector for every word in that corpus. After fitting, the words with similar meaning have their vectors close to each other, that is, the distance between them is small compared to the distance between the vectors for words that have very different meanings. Nowadays, Word2vec has become a standard in natural language processing problems and often it provides very useful insights into information retrieval tasks. For this particular problem, we will be using the Google news vectors. This is a pretrained Word2vec model trained on the Google News corpus. Every word, when represented by its Word2vec vector, gets a position in space, as depicted in the following diagram: All the words in this example, such as Germany, Berlin, France, and Paris, can be represented by a 300-dimensional vector, if we are using the pretrained vectors from the Google news corpus. When we use Word2vec representations for these words and we subtract the vector of Germany from the vector of Berlin and add the vector of France to it, we will get a vector that is very similar to the vector of Paris. The Word2vec model thus carries the meaning of words in the vectors. The information carried by these vectors constitutes a very useful feature for our task. For a user-friendly, yet more in-depth, explanation and description of possible applications of Word2vec, we suggest reading https://www.distilled.net/resources/a-beginners-guide-to-Word2vec-aka-whats-the-opposite-of-canada/, or if you need a more mathematically defined explanation, we recommend reading this paper: http://www.1-4-5.net/~dmm/ml/how_does_Word2vec_work.pdf To load the Word2vec features, we will be using Gensim. If you don't have Gensim, you can install it easily using pip. At this time, it is suggested you also install the pyemd package, which will be used by the WMD distance function, a function that will help us to relate two Word2vec vectors: pip install gensim pip install pyemd To load the Word2vec model, we download the GoogleNews-vectors-negative300.bin.gz binary and use Gensim's load_Word2vec_format function to load it into memory. You can easily download the binary from an Amazon AWS repository using the wget command from a shell: wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz" After downloading and decompressing the file, you can use it with the Gensim KeyedVectors functions: import gensim model = gensim.models.KeyedVectors.load_word2vec_format( 'GoogleNews-vectors-negative300.bin.gz', binary=True) Now, we can easily get the vector of a word by calling model[word]. However, a problem arises when we are dealing with sentences instead of individual words. In our case, we need vectors for all of question1 and question2 in order to come up with some kind of comparison. For this, we can use the following code snippet. The snippet basically adds the vectors for all words in a sentence that are available in the Google news vectors and gives a normalized vector at the end. We can call this sentence to vector, or Sent2Vec. Make sure that you have Natural Language Tool Kit (NLTK) installed before running the preceding function: $ pip install nltk It is also suggested that you download the punkt and stopwords packages, as they are part of NLTK: import nltk nltk.download('punkt') nltk.download('stopwords') If NLTK is now available, you just have to run the following snippet and define the sent2vec function: from nltk.corpus import stopwords from nltk import word_tokenize stop_words = set(stopwords.words('english')) def sent2vec(s, model): M = [] words = word_tokenize(str(s).lower()) for word in words: #It shouldn't be a stopword if word not in stop_words: #nor contain numbers if word.isalpha(): #and be part of word2vec if word in model: M.append(model[word]) M = np.array(M) if len(M) > 0: v = M.sum(axis=0) return v / np.sqrt((v ** 2).sum()) else: return np.zeros(300) When the phrase is null, we arbitrarily decide to give back a standard vector of zero values. To calculate the similarity between the questions, another feature that we created was word mover's distance. Word mover's distance uses Word2vec embeddings and works on a principle similar to that of earth mover's distance to give a distance between two text documents. Simply put, word mover's distance provides the minimum distance needed to move all the words from one document to another document. The WMD has been introduced by this paper: KUSNER, Matt, et al. From word embeddings to document distances. In: International Conference on Machine Learning. 2015. p. 957-966 which can be found at http://proceedings.mlr.press/v37/kusnerb15.pdf. For a hands-on tutorial on the distance, you can also refer to this tutorial based on the Gensim implementation of the distance: https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html Final Word2vec (w2v) features also include other distances, more usual ones such as the Euclidean or cosine distance. We complete the sequence of features with some measurement of the distribution of the two document vectors: Word mover distance Normalized word mover distance Cosine distance between vectors of question1 and question2 Manhattan distance between vectors of question1 and question2 Jaccard similarity between vectors of question1 and question2 Canberra distance between vectors of question1 and question2 Euclidean distance between vectors of question1 and question2 Minkowski distance between vectors of question1 and question2 Braycurtis distance between vectors of question1 and question2 The skew of the vector for question1 The skew of the vector for question2 The kurtosis of the vector for question1 The kurtosis of the vector for question2 All the Word2vec features are denoted by fs4. A separate set of w2v features consists in the matrices of Word2vec vectors themselves: Word2vec vector for question1 Word2vec vector for question2 These will be represented by fs5: w2v_q1 = np.array([sent2vec(q, model) for q in data.question1]) w2v_q2 = np.array([sent2vec(q, model) for q in data.question2]) In order to easily implement all the different distance measures between the vectors of the Word2vec embeddings of the Quora questions, we use the implementations found in the scipy.spatial.distance module: from scipy.spatial.distance import cosine, cityblock, jaccard, canberra, euclidean, minkowski, braycurtis data['cosine_distance'] = [cosine(x,y) for (x,y) in zip(w2v_q1, w2v_q2)] data['cityblock_distance'] = [cityblock(x,y) for (x,y) in zip(w2v_q1, w2v_q2)] data['jaccard_distance'] = [jaccard(x,y) for (x,y) in zip(w2v_q1, w2v_q2)] data['canberra_distance'] = [canberra(x,y) for (x,y) in zip(w2v_q1, w2v_q2)] data['euclidean_distance'] = [euclidean(x,y) for (x,y) in zip(w2v_q1, w2v_q2)] data['minkowski_distance'] = [minkowski(x,y,3) for (x,y) in zip(w2v_q1, w2v_q2)] data['braycurtis_distance'] = [braycurtis(x,y) for (x,y) in zip(w2v_q1, w2v_q2)] All the features names related to distances are gathered under the list fs4_1: fs4_1 = ['cosine_distance', 'cityblock_distance', 'jaccard_distance', 'canberra_distance', 'euclidean_distance', 'minkowski_distance', 'braycurtis_distance'] The Word2vec matrices for the two questions are instead horizontally stacked and stored away in the w2v variable for later usage: w2v = np.hstack((w2v_q1, w2v_q2)) The Word Mover's Distance is implemented using a function that returns the distance between two questions, after having transformed them into lowercase and after removing any stopwords. Moreover, we also calculate a normalized version of the distance, after transforming all the Word2vec vectors into L2-normalized vectors (each vector is transformed to the unit norm, that is, if we squared each element in the vector and summed all of them, the result would be equal to one) using the init_sims method: def wmd(s1, s2, model): s1 = str(s1).lower().split() s2 = str(s2).lower().split() stop_words = stopwords.words('english') s1 = [w for w in s1 if w not in stop_words] s2 = [w for w in s2 if w not in stop_words] return model.wmdistance(s1, s2) data['wmd'] = data.apply(lambda x: wmd(x['question1'], x['question2'], model), axis=1) model.init_sims(replace=True) data['norm_wmd'] = data.apply(lambda x: wmd(x['question1'], x['question2'], model), axis=1) fs4_2 = ['wmd', 'norm_wmd'] After these last computations, we now have most of the important features that are needed to create some basic machine learning models, which will serve as a benchmark for our deep learning models. The following table displays a snapshot of the available features: Let's train some machine learning models on these and other Word2vec based features. Testing machine learning models Before proceeding, depending on your system, you may need to clean up the memory a bit and free space for machine learning models from previously used data structures. This is done using gc.collect, after deleting any past variables not required anymore, and then checking the available memory by exact reporting from the psutil.virtualmemory function: import gc import psutil del([tfv_q1, tfv_q2, tfv, q1q2, question1_vectors, question2_vectors, svd_q1, svd_q2, q1_tfidf, q2_tfidf]) del([w2v_q1, w2v_q2]) del([model]) gc.collect() psutil.virtual_memory() At this point, we simply recap the different features created up to now, and their meaning in terms of generated features: fs_1: List of basic features fs_2: List of fuzzy features fs3_1: Sparse data matrix of TFIDF for separated questions fs3_2: Sparse data matrix of TFIDF for combined questions fs3_3: Sparse data matrix of SVD fs3_4: List of SVD statistics fs4_1: List of w2vec distances fs4_2: List of wmd distances w2v: A matrix of transformed phrase's Word2vec vectors by means of the Sent2Vec function We evaluate two basic and very popular models in machine learning, namely logistic regression and gradient boosting using the xgboost package in Python. The following table provides the performance of the logistic regression and xgboost algorithms on different sets of features created earlier, as obtained during the Kaggle competition: Feature set Logistic regression accuracy xgboost accuracy Basic features (fs1) 0.658 0.721 Basic features + fuzzy features (fs1 + fs2) 0.660 0.738 Basic features + fuzzy features + w2v features (fs1 + fs2 + fs4) 0.676 0.766 W2v vector features (fs5) * 0.78 Basic features + fuzzy features + w2v features + w2v vector features (fs1 + fs2 + fs4 + fs5) * 0.814 TFIDF-SVD features (fs3-1) 0.777 0.749 TFIDF-SVD features (fs3-2) 0.804 0.748 TFIDF-SVD features (fs3-3) 0.706 0.763 TFIDF-SVD features (fs3-4) 0.700 0.753 TFIDF-SVD features (fs3-5) 0.714 0.759 * = These models were not trained due to high memory requirements. We can treat the performances achieved as benchmarks or baseline numbers before starting with deep learning models, but we won't limit ourselves to that and we will be trying to replicate some of them. We are going to start by importing all the necessary packages. As for as the logistic regression, we will be using the scikit-learn implementation. The xgboost is a scalable, portable, and distributed gradient boosting library (a tree ensemble machine learning algorithm). Initially created by Tianqi Chen from Washington University, it has been enriched with a Python wrapper by Bing Xu, and an R interface by Tong He (you can read the story behind xgboost directly from its principal creator at homes.cs.washington.edu/~tqchen/2016/03/10/story-and-lessons-behind-the-evolution-of-xgboost.html ). The xgboost is available for Python, R, Java, Scala, Julia, and C++, and it can work both on a single machine (leveraging multithreading) and in Hadoop and Spark clusters. Detailed instruction for installing xgboost on your system can be found on this page: github.com/dmlc/xgboost/blob/master/doc/build.md The installation of xgboost on both Linux and macOS is quite straightforward, whereas it is a little bit trickier for Windows users. For this reason, we provide specific installation steps for having xgboost working on Windows: First, download and install Git for Windows (git-for-windows.github.io) Then, you need a MINGW compiler present on your system. You can download it from www.mingw.org according to the characteristics of your system From the command line, execute: $> git clone --recursive https://github.com/dmlc/xgboost $> cd xgboost $> git submodule init $> git submodule update Then, always from the command line, you copy the configuration for 64-byte systems to be the default one: $> copy makemingw64.mk config.mk Alternatively, you just copy the plain 32-byte version: $> copy makemingw.mk config.mk After copying the configuration file, you can run the compiler, setting it to use four threads in order to speed up the compiling process: $> mingw32-make -j4 In MinGW, the make command comes with the name mingw32-make; if you are using a different compiler, the previous command may not work, but you can simply try: $> make -j4 Finally, if the compiler completed its work without errors, you can install the package in Python with: $> cd python-package $> python setup.py install If xgboost has also been properly installed on your system, you can proceed with importing both machine learning algorithms: from sklearn import linear_model from sklearn.preprocessing import StandardScaler import xgboost as xgb Since we will be using a logistic regression solver that is sensitive to the scale of the features (it is the sag solver from https://github.com/EpistasisLab/tpot/issues/292, which requires a linear computational time in respect to the size of the data), we will start by standardizing the data using the scaler function in scikit-learn: scaler = StandardScaler() y = data.is_duplicate.values y = y.astype('float32').reshape(-1, 1) X = data[fs_1+fs_2+fs3_4+fs4_1+fs4_2] X = X.replace([np.inf, -np.inf], np.nan).fillna(0).values X = scaler.fit_transform(X) X = np.hstack((X, fs3_3)) We also select the data for the training by first filtering the fs_1, fs_2, fs3_4, fs4_1, and fs4_2 set of variables, and then stacking the fs3_3 sparse SVD data matrix. We also provide a random split, separating 1/10 of the data for validation purposes (in order to effectively assess the quality of the created model): np.random.seed(42) n_all, _ = y.shape idx = np.arange(n_all) np.random.shuffle(idx) n_split = n_all // 10 idx_val = idx[:n_split] idx_train = idx[n_split:] x_train = X[idx_train] y_train = np.ravel(y[idx_train]) x_val = X[idx_val] y_val = np.ravel(y[idx_val]) As a first model, we try logistic regression, setting the regularization l2 parameter C to 0.1 (modest regularization). Once the model is ready, we test its efficacy on the validation set (x_val for the training matrix, y_val for the correct answers). The results are assessed on accuracy, that is the proportion of exact guesses on the validation set: logres = linear_model.LogisticRegression(C=0.1, solver='sag', max_iter=1000) logres.fit(x_train, y_train) lr_preds = logres.predict(x_val) log_res_accuracy = np.sum(lr_preds == y_val) / len(y_val) print("Logistic regr accuracy: %0.3f" % log_res_accuracy) After a while (the solver has a maximum of 1,000 iterations before giving up converging the results), the resulting accuracy on the validation set will be 0.743, which will be our starting baseline. Now, we try to predict using the xgboost algorithm. Being a gradient boosting algorithm, this learning algorithm has more variance (ability to fit complex predictive functions, but also to overfit) than a simple logistic regression afflicted by greater bias (in the end, it is a summation of coefficients) and so we expect much better results. We fix the max depth of its decision trees to 4 (a shallow number, which should prevent overfitting) and we use an eta of 0.02 (it will need to grow many trees because the learning is a bit slow). We also set up a watchlist, keeping an eye on the validation set for an early stop if the expected error on the validation doesn't decrease for over 50 steps. It is not best practice to stop early on the same set (the validation set in our case) we use for reporting the final results. In a real-world setting, ideally, we should set up a validation set for tuning operations, such as early stopping, and a test set for reporting the expected results when generalizing to new data. After setting all this, we run the algorithm. This time, we will have to wait for longer than we when running the logistic regression: params = dict() params['objective'] = 'binary:logistic' params['eval_metric'] = ['logloss', 'error'] params['eta'] = 0.02 params['max_depth'] = 4 d_train = xgb.DMatrix(x_train, label=y_train) d_valid = xgb.DMatrix(x_val, label=y_val) watchlist = [(d_train, 'train'), (d_valid, 'valid')] bst = xgb.train(params, d_train, 5000, watchlist, early_stopping_rounds=50, verbose_eval=100) xgb_preds = (bst.predict(d_valid) >= 0.5).astype(int) xgb_accuracy = np.sum(xgb_preds == y_val) / len(y_val) print("Xgb accuracy: %0.3f" % xgb_accuracy) The final result reported by xgboost is 0.803 accuracy on the validation set. Building TensorFlow model The deep learning models in this article are built using TensorFlow, based on the original script written by Abhishek Thakur using Keras (you can read the original code at https://github.com/abhishekkrthakur/is_that_a_duplicate_quora_question). Keras is a Python library that provides an easy interface to TensorFlow. Tensorflow has official support for Keras, and the models trained using Keras can easily be converted to TensorFlow models. Keras enables the very fast prototyping and testing of deep learning models. In our project, we rewrote the solution entirely in TensorFlow from scratch anyway. To start, let's import the necessary libraries, in particular, TensorFlow, and let's check its version by printing it: import zipfile from tqdm import tqdm_notebook as tqdm import tensorflow as tf print("TensorFlow version %s" % tf.__version__) At this point, we simply load the data into the df pandas dataframe or we load it from disk. We replace the missing values with an empty string and we set the y variable containing the target answer encoded as 1 (duplicated) or 0 (not duplicated): try: df = data[['question1', 'question2', 'is_duplicate']] except: df = pd.read_csv('data/quora_duplicate_questions.tsv', sep='t') df = df.drop(['id', 'qid1', 'qid2'], axis=1) df = df.fillna('') y = df.is_duplicate.values y = y.astype('float32').reshape(-1, 1) To summarize, we built a model with the help of TensorFlow in order to detect duplicated questions from the Quora dataset. To know more about how to build and train your own deep learning models with TensorFlow confidently, do checkout this book TensorFlow Deep Learning Projects. TensorFlow 1.9.0-rc0 release announced Implementing feedforward networks with TensorFlow How TFLearn makes building TensorFlow models easier
Read more
  • 0
  • 1
  • 15964

article-image-how-to-execute-jobs-in-an-iterative-way-with-pentaho-data-integration-pdi
Vijin Boricha
26 Feb 2018
8 min read
Save for later

How to execute jobs in an iterative way with Pentaho Data Integration (PDI)

Vijin Boricha
26 Feb 2018
8 min read
[box type="note" align="" class="" width=""]This is a book excerpt  from Learning Pentaho Data Integration 8 CE - Third Edition written by María Carina Roldán.  From this book, you will learn to explore, transform, and integrate your data across multiple sources.[/box] Today, we will learn to configure and use Job executor along with capturing the result filenames. Using Job executors The Job Executor is a PDI step that allows you to execute a Job several times simulating a loop. The executor receives a dataset, and then executes the Job once for each row or a set of rows of the incoming dataset. To understand how this works, we will build a very simple example. The Job that we will execute will have two parameters: a folder and a file. It will create the folder, and then it will create an empty file inside the new folder. Both the name of the folder and the name of the file will be taken from the parameters. The main transformation will execute the Job iteratively for a list of folder and file names. Let's start by creating the Job: Create a new Job. Double-click on the work area to bring up the Job properties window. Use it to define two named parameters: FOLDER_NAME and FILE_NAME. Drag a START, a Create a folder, and a Create file entry to the work area and link them as follows: 4. Double-click the Create a folder entry. As Folder name, type ${FOLDER_NAME}. 5. Double-click the Create file entry. As File name, type ${FOLDER_NAME}/${FILE_NAME}. 6. Save the Job and test it, providing values for the folder and filename. The Job should create a folder with an empty file inside, both with the names that you provide as parameters. Now create the main Transformation: Create a Transformation. Drag a Data Grid step to the work area and define a single field named foldername. The type should be String. Fill the Data Grid with a list of folders to be created, as shown in the next example: 4. As the name of the file, you can create any name of your choice. As an example, we will create a random name. For this, we use a Generate random value and a UDJE step, and configure them as shown: 5. With the last step selected, run a preview. You should see the full list of folders and filenames, as shown in the next sample image: 6. At the end of the stream, add a Job Executor step. You will find it under the Flow category of steps. 7. Double-click on the Job Executor step. 8. As Job, select the path to the Job created before, for example, ${Internal.Entry.Current.Directory}/create_folder_and_file.kjb 9. Configure the Parameters grid as follows: 10. Close the window and save the transformation. 11. Run the transformation. The Step Metrics in the Execution Results window reflects what happens: 12. Click on the Logging tab. You will see the full log for the Job. 13. Browse your filesystem. You will find all the folders and files just created. As you see, PDI executes the Job as many times as the number of rows that arrives to the Job Executor step, once for every row. Each time the Job executes, it receives values for the named parameters, and creates the folder and file using these values. Configuring the executors with advanced settings Just as it happens with the Transformation Executors that you already know, the Job Executors can also be configured with similar settings. This allows you to customize the behavior and the output of the Job to be executed. Let's summarize the options. Getting the results of the execution of the job The Job Executor doesn't cause the Transformation to abort if the Job that it runs has errors. To verify this, run the sample transformation again. As the folders already exist, you expect that each individual execution fails. However, the Job Executor ends without error. In order to capture the errors in the execution of the Job, you have to get the execution results. This is how you do it: Drag a step to the work area where you want to redirect the results. It could be any step. For testing purposes, we will use a Text file output step. Create a hop from the Job Executor toward this new step. You will be prompted for the kind of hop. Choose the This output will contain the execution results option as shown here: 3. Double-click on the Job Executor and select the Execution results tab. You will see the list of metrics and results available. The Field name column has the names of the fields that will contain these results. If there are results you are not interested in, delete the value in the Field name column. For the results that you want to keep, you can leave the proposed field name or type a different name. The following screenshot shows an example that only generates a field for the log: 4. When you are done, click on OK. 5. With the destination step selected, run a preview. You will see the results that you just defined, as shown in the next example: 6. If you copy any of the lines and paste it into a text editor, you will see the full log for the execution, as shown in the following example: 2017/10/26 23:45:53 - create_folder_and_file - Starting entry [Create a folder] 2017/10/26 23:45:53 - create_folder_and_file - Starting entry [Create file] 2017/10/26 23:45:53 - Create file - File [c:/pentaho/files/folder1/sample_50n9q8oqsg6ib.tmp] created! 2017/10/26 23:45:53 - create_folder_and_file - Finished job entry [Create file] (result=[true]) 2017/10/26 23:45:53 - create_folder_and_file - Finished job entry [Create a folder] (result=[true]) Working with groups of data As you know, jobs don't work with datasets. Transformations do. However, you can still use the Job Executor to send the rows to the Job. Then, any transformation executed by your Job can get the rows using a Get rows from result step. By default, the Job Executor executes once for every row in your dataset, but there are several possibilities where you can configure in the Row Grouping tab of the configuration window: You can send groups of N rows, where N is greater than 1 You can pass a group of rows based on the value in a field You can send groups of rows based on the time the step collects rows before executing the Job Using variables and named parameters If the Job has named parameters—as in the example that we built—you provide values for them in the Parameters tab of the Job Executor step. For each named parameter, you can assign the value of a field or a fixed-static-value. In case you execute the Job for a group of rows instead of a single one, the parameters will take the values from the first row of data sent to the Job. Capturing the result filenames At the output of the Job Executor, there is also the possibility to get the result filenames. Let's modify the Transformation that we created to show an example of this kind of output: Open the transformation created at the beginning of the section. Drag a Write to log step to the work area. Create a hop from the Job Executor toward the Write to log step. When asked for the kind of hop, select the option named This output will contain the result file names after execution. Your transformation will look as follows: 4. Double-click the Job Executor and select the Result files tab. 5. Configure it as shown: Double-click the Write to log step and, in the Fields grid, add the FileName field. Close the window and save the transformation. Run it. Look at the Logging tab in the Execution Results window. You will see the names of the files in the result filelist, which are the files created in the Job: ... ... - Write to log.0 - ... - Write to log.0 - ------------> Linenr 1---------------------- -------- ... - Write to log.0 - filename = file:///c:/pentaho/files/folder1/sample_5agh7lj6ncqh7.tmp ... - Write to log.0 - ... - Write to log.0 - ==================== ... - Write to log.0 - ... - Write to log.0 - ------------> Linenr 2---------------------- -------- ... - Write to log.0 - filename = file:///c:/pentaho/files/folder2/sample_6n0rhmrpvj21n.tmp ... - Write to log.0 - ... - Write to log.0 - ==================== ... - Write to log.0 - ... - Write to log.0 - ------------> Linenr 3---------------------- -------- ... - Write to log.0 - filename = file:///c:/pentaho/files/folder3/sample_7ulkja68vf1td.tmp ... - Write to log.0 - ... - Write to log.0 - ==================== ... The example that you just created showed the option with a Job Executor. We learned how to nest jobs and iterate the execution of jobs. You can know more about executing transformations in an iterative way and launching transformations and jobs from the Command Line from this book Learning Pentaho Data Integration 8 CE - Third Edition.      
Read more
  • 0
  • 0
  • 15839

article-image-creating-effective-dashboards-using-splunk-tutorial
Sunith Shetty
28 Jul 2018
10 min read
Save for later

Creating effective dashboards using Splunk [Tutorial]

Sunith Shetty
28 Jul 2018
10 min read
Splunk is easy to use for developing a powerful analytical dashboard with multiple panels. A dashboard with too many panels, however, will require scrolling down the page and can cause the viewer to miss crucial information. An effective dashboard should generally meet the following conditions: Single screen view: The dashboard fits in a single window or page, with no scrolling Multiple data points: Charts and visualizations should display a number of data points Crucial information highlighted: The dashboard points out the most important information, using appropriate titles, labels, legends, markers, and conditional formatting as required Created with the user in mind: Data is presented in a way that is meaningful to the user Loads quickly: The dashboard returns results in 10 seconds or less Avoid redundancy: The display does not repeat information in multiple places In this tutorial, we learn to create different types of dashboards using Splunk. We will also discuss how to gather business requirements for your dashboards. Types of Splunk dashboards There are three kinds of dashboards typically created with Splunk: Dynamic form-based dashboards Real-time dashboards Dashboards as scheduled reports Dynamic form-based dashboards allow Splunk users to modify the dashboard data without leaving the page. This is accomplished by adding data-driven input fields (such as time, radio button, textbox, checkbox, dropdown, and so on) to the dashboard. Updating these inputs changes the data based on the selections. Dynamic form-based dashboards have existed in traditional business intelligence tools for decades now, so users who frequently use them will be familiar with changing prompt values on the fly to update the dashboard data. Real-time dashboards are often kept on a big panel screen for constant viewing, simply because they are so useful. You see these dashboards in data centers, network operations centers (NOCs), or security operations centers (SOCs) with constant format and data changing in real time. The dashboard will also have indicators and alerts for operators to easily identify and act on a problem. Dashboards like this typically show the current state of security, network, or business systems, using indicators for web performance and traffic, revenue flow, login failures, and other important measures. Dashboards as scheduled reports may not be exposed for viewing; however, the dashboard view will generally be saved as a PDF file and sent to email recipients at scheduled times. This format is ideal when you need to send information updates to multiple recipients at regular intervals, and don't want to force them to log in to Splunk to capture the information themselves. We will create the first two types of dashboards, and you will learn how to use the Splunk dashboard editor to develop advanced visualizations along the way. Gathering business requirements As a Splunk administrator, one of the most important responsibilities is to be responsible for the data. As a custodian of data, a Splunk admin has significant influence over how to interpret and present information to users. It is common for the administrator to create the first few dashboards. A more mature implementation, however, requires collaboration to create an output that is beneficial to a variety of user requirements and may be completed by a Splunk development resource with limited administrative rights. Make it a habit to consistently request users input regarding the Splunk delivered dashboards and reports and what makes them useful. Sit down with day-to-day users and layout, on a drawing board, for example, the business process flows or system diagrams to understand how the underlying processes and systems you're trying to measure really work. Look for key phrases like these, which signify what data is most important to the business: If this is broken, we lose tons of revenue... This is a constant point of failure... We don't know what's going on here... If only I can see the trend, it will make my work easier... This is what my boss wants to see... Splunk dashboard users may come from many areas of the business. You want to talk to all the different users, no matter where they are on the organizational chart. When you make friends with the architects, developers, business analysts, and management, you will end up building dashboards that benefit the organization, not just individuals. With an initial dashboard version, ask for users thoughts as you observe them using it in their work and ask what can be improved upon, added, or changed. We hope that at this point, you realize the importance of dashboards and are ready to get started creating some, as we will do in the following sections. Dynamic form-based dashboard In this section, we will create a dynamic form-based dashboard in our Destinations app to allow users to change input values and rerun the dashboard, presenting updated data. Here is a screenshot of the final output of this dynamic form-based dashboard: Let's begin by creating the dashboard itself and then generate the panels: Go the search bar in the Destinations app Run this search command: SPL> index=main status_type="*" http_uri="*" server_ip="*" | top status_type, status_description, http_uri, server_ip Be careful when copying commands with quotation marks. It is best to type in the entire search command to avoid problems. Go to Save As | Dashboard Panel Fill in the information based on the following screenshot: Click on Save Close the pop-up window that appears (indicating that the dashboard panel was created) by clicking on the X in the top-right corner of the window Creating a Status Distribution panel We will go to the after all the panel searches have been generated. Let's go ahead and create the second panel: In the search window, type in the following search command: SPL> index=main status_type="*" http_uri=* server_ip=* | top status_type You will save this as a dashboard panel in the newly created dashboard. In the Dashboard option, click on the Existing button and look for the new dashboard, as seen here. Don't forget to fill in the Panel Title as Status Distribution: Click on Save when you are done and again close the pop-up window, signaling the addition of the panel to the dashboard. Creating the Status Types Over Time panel Now, we'll move on to create the third panel: Type in the following search command and be sure to run it so that it is the active search: SPL> index=main status_type="*" http_uri=* server_ip=* | timechart count by http_status_code You will save this as a Dynamic Form-based Dashboard panel as well. Type in Status Types Over Time in the Panel Title field: Click on Save and close the pop-up window, signaling the addition of the panel to the dashboard. Creating the Hits vs Response Time panel Now, on to the final panel. Run the following search command: SPL> index=main status_type="*" http_uri=* server_ip=* | timechart count, avg(http_response_time) as response_time Save this dashboard panel as Hits vs Response Time: Arrange the dashboard We'll move on to look at the dashboard we've created and make a few changes: Click on the View Dashboard button. If you missed out on the View Dashboard button, you can find your dashboard by clicking on Dashboards in the main navigation bar. Let's edit the panel arrangement. Click on the Edit button. Move the Status Distribution panel to the upper-right row. Move the Hits vs Response Time panel to the lower-right row. Click on Save to save your layout changes. Look at the following screenshot. The dashboard framework you've created should now look much like this. The dashboard probably looks a little plainer than you expected it to. But don't worry; we will improve the dashboard visuals one panel at a time: Panel options in dashboards In this section, we will learn how to alter the look of our panels and create visualizations. Go to the edit dashboard mode by clicking on the Edit button. Each dashboard panel will have three setting options to work with: edit search, select visualization, and visualization format options. They are represented by three drop-down icons: The Edit Search window allows you to modify the search string, change the time modifier for the search, add auto-refresh and progress bar options, as well as convert the panel into a report: The Select Visualization dropdown allows you to change the type of visualization to use for the panel, as shown in the following screenshot: Finally, the Visualization Options dropdown will give you the ability to fine-tune your visualization. These options will change depending on the visualization you select. For a normal statistics table, this is how it will look: Pie chart – Status Distribution Go ahead and change the Status Distribution visualization panel to a pie chart. You do this by selecting the Select Visualization icon and selecting the Pie icon. Once done, the panel will look like the following screenshot: Stacked area chart – Status Types Over Time We will change the view of the Status Types Over Time panel to an area chart. However, by default, area charts will not be stacked. We will update this through adjusting the visualization options: Change the Status Types Over Time panel to an Area Chart using the same Select Visualization button as the prior pie chart exercise. Make the area chart stacked using the Format Visualization icon. In the Stack Mode section, click on Stacked. For Null Values, select Zero. Use the chart that follows for guidance: Click on Apply. The panel will change right away. Remove the _time label as it is already implied. You can do this in the X-Axis section by setting the Title to None. Close the Format Visualization window by clicking on the X in the upper-right corner: Here is the new stacked area chart panel: Column with overlay combination chart – Hits vs Response Time When representing two or more kinds of data with different ranges, using a combination chart—in this case combining a column and a line—can tell a bigger story than one metric and scale alone. We'll use the Hits vs Response Time panel to explore the combination charting options: In the Hits vs Response Time panel, change the chart panel visualization to Column In the Visualization Options window, click on Chart Overlay In the Overlay selection box, select response_time Turn on View as Axis Click on X-Axis from the list of options on the left of the window and change the Title to None Click on Legend from the list of options on the left Change the Legend Position to Bottom Click on the X in the upper-right-hand corner to close the Visualization Options window The new panel will now look similar to the following screenshot. From this and the prior screenshot, you can see there was clearly an outage in the overnight hours: Click on Done to save all the changes you made and exit the Edit mode The dashboard has now come to life. This is how it should look now: To summarize we saw how to create different types of dashboards. To know more about core Splunk functionalities to transform machine data into powerful insights, check out this book Splunk 7 Essentials, Third Edition. Splunk leverages AI in its monitoring tools Splunk Industrial Asset Intelligence (Splunk IAI) targets Industrial IoT marketplace Create a data model in Splunk to enable interactive reports and dashboards
Read more
  • 0
  • 0
  • 15782
article-image-debug-application-using-qt-creator
Gebin George
27 Apr 2018
9 min read
Save for later

How to Debug an application using Qt Creator

Gebin George
27 Apr 2018
9 min read
Today, we will learn about debugging an application using Qt Creator. A debugger is a program that can be used to test and debug other programs, in case of a sudden crash during the program execution or an unexpected behavior in the logic of the program. Most of the time (if not always), debuggers are used in the development environment and in conjunction with an IDE. In our case, we will learn how to use a debugger with Qt Creator. It is important to note that debuggers are not part of the Qt Framework, and, just like compilers, they are usually provided by the operating system SDK. Qt Creator automatically detects and uses debuggers if they are present on a system. This can be checked by navigating into the Qt Creator Options page via the main menu Tools and then Options. Make sure to select Build & Run from the list on the left side and then switch to the Debuggers tab from the top. You should be able to see one or more autodetected debuggers on the list. [box type="info" align="" class="" width=""]Windows Users: You should see something similar to the screenshot after this information box. If not, this means you have not installed any debuggers. You can easily download and install it using the instructions provided here: https:/ / docs. microsoft. com/ en- us/ windows- hardware/ drivers/debugger/ Or, you can independently search for the following topic online: Debugging Tools for Windows (WinDbg, KD, CDB, NTSD). Nevertheless, after the debugger is installed (assumingly, CDB or Microsoft Console Debugger for Microsoft Visual C++ Compilers and GDB for GCC Compilers), you can restart Qt Creator and return to this page. You should be able to have one or more entries similar to the following. Since we have installed a 32-bit version of the Qt and OpenCV Frameworks, choose the entry with x86 in its name to view its path, type, and other properties. macOS and Linux Users: There shouldn't be any action needed on your part and, depending on the OS, you'll see a GDB, LLDB, or some other debugger in the entries.[/box] Here's the screenshot of the Build & Run tab on the Options page: Depending on the operating system and the installed debugger, the preceding screenshot might be slightly different. Nevertheless, you'll have a debugger that you need to make sure is correctly set as the debugger for the Qt Kit you are using. So, make a note of the debugger path and name and switch to the Kits tab, and, after selecting the Qt Kit you were using, make sure the debugger for it is correctly set, as you can see in the following screenshot: Don't worry about choosing the wrong debugger, or any other options, since you'll be warned with relevant icons beside the Qt Kit icon selected at the top. The icon seen in the following image on the left side is usually displayed when everything is okay with the Kit, the second one from the left is an indication that something is not right, and the one on the right means a critical error. Move your mouse over the icon when it appears to see more information about the required actions needed to fix the issue: [box type="info" align="" class="" width=""]Critical issues with Qt Kits can be caused by many different factors such as a missing compiler which will make the kit completely useless until the issue is resolved. An example of a warning message in a Qt Kit would be a missing debugger, which will not make the kit useless, but you won't be able to use the debugger with it, thus it means less functionality than a completely configured Qt Kit.[/box] After the debugger is correctly set, you can start debugging your applications in one of the following ways, which basically have the same result: ending up in the Debugger view of the Qt Creator: Starting an application in Debugging mode Attaching to a running application (or process) [box type="info" align="" class="" width=""]Note that a debugging process can be started in many ways, such as remotely, by attaching to a process running on a separate machine and so on. However, the preceding methods will suffice for most cases and especially for the ones relevant to the Qt+OpenCV application development and what we learned throughout this book.[/box] Getting started with the debugging mode To start an application in the debugging mode, after opening a Qt project, you can use one of the following methods: Pressing the F5 button Using the Start Debugging button, right below the usual Run button with a similar icon, but with a small bug on it Using the main menu entries in the following order: Debug/Start Debugging/Start Debugging. To attach the debugger to a running application, you can use the main menu entries in the following order: Debug/Start Debugging/Attach to Running Application. This will open up the List of Processes window, from which you can choose your application or any other process you want to debug using its process ID or executable name. You can also use the Filter field (as seen in the following image) to find your application, since, most probably, the list of processes will be quite a long one. After choosing the correct process, make sure to press the Attach to Process button. No matter which one of the preceding methods you use, you will end up in the Qt Creator Debug mode, which is quite similar to the Edit mode, but it also allows you to do the following, among many others: Add, Enable, Disable, and View Breakpoints in the code (a Breakpoint is simply a point or a line in the code that we want the debugger to pause in the process and allow us to do a more detailed analysis of the status of the program) Interrupt running programs and processes to view and examine the code View and examine the function call stack (the call stack is a stack containing the hierarchical list of functions that led to a breakpoint or interrupted state) View and examine the variables Disassemble the source codes (disassembling in this sense means extracting the exact instructions that correspond to the function calls and other C++ codes in our program) You'll notice a performance drop in the application when it is started in debugging mode, which is obviously because of the fact that codes are being monitored and traced by the debugger. Here's a screenshot of the Qt Creator Debug mode, in which all of the capabilities mentioned earlier are visible in a single window and in the Debug mode of the Qt Creator: The area specified with the number 1 in the preceding screenshot in the code editor that you have already used through the book and are quite familiar with. Each line of code has a line number; you can click on their left side to toggle a breakpoint anywhere you want in the code. You can also right-click on the line numbers to set, remove, disable, or enable a breakpoint by selecting Set Breakpoint at Line X, Remove Breakpoint X, Disable Breakpoint X, or Enable Breakpoint X, where X in all of the commands mentioned here needs to be replaced by the line number. Apart from the code editor, you can also use the area mentioned with number 4 in the preceding screenshot to add, delete, edit, and further modify breakpoints in the code. You can also right-click on the same toolbar below the code editor that contains the debugger controls to open up the following menu and add or remove more panes to display additional debug and analysis information. We will cover the default debugger view, but make sure to check out each one of the following options on your own to familiarize yourself with the debugger even more: The area specified with number 2 in the preceding code can be used to view the call stack. Whether you interrupt the program by pressing the Interrupt button or choosing Debug/Interrupt from the menu while the it is running, set a breakpoint and stop the program in a specific line of code, or a malfunctioning code causes the program to fall into a trap and pause the process (since a crash and exception will be caught by the debugger), you can always view the hierarchy of function calls that led to the interrupted state, or further analyze them by checking the area 2 in the preceding Qt Creator screenshot. Finally, you can use the third area in the previous screenshot to view the local and global variables of the program in the interrupted location in the code. You can see the contents of the variables, whether they are standard data types, such as integers and floats or structures and classes, and also you can further expand and analyze their content to test and analyze any possible issues in your code. Using a debugger efficiently can mean hours of difference in testing and solving the issues in your code. In terms of practical usage of the debuggers, there is really no other way but to use it as much as you can and develop habits of your own to use the debugger, but also make note of good practices and tricks you found along the way and the ones we just went through. If you are interested, you can also read online about other possible methods of debugging, such as remote debugging, debugging using crash dump files (on Windows), and more. We saw how to practically debug an application using QT debugging mode. [box type="note" align="" class="" width=""]You read an excerpt from the book, Computer Vision with OpenCV 3 and Qt 5 written by Amin Ahmadi Tazehkandi.  The book covers development of cross-platform applications using OpenCV 3 and Qt 5.[/box] 3 ways to deploy a QT and OpenCV application Debugging Your .NET Application    
Read more
  • 0
  • 0
  • 15743

article-image-how-to-perform-exception-handling-in-python-with-try-catch-and-finally
Guest Contributor
10 Dec 2019
9 min read
Save for later

How to perform exception handling in Python with ‘try, catch and finally’

Guest Contributor
10 Dec 2019
9 min read
An integral part of using Python involves the art of handling exceptions. There are primarily two types of exceptions; Built-in exceptions and User-Defined Exceptions. In such cases, the error handling resolution is to save the state of execution in the moment of error which interrupts the normal program flow to execute a special function or a code which is called Exception Handler. There are many types of errors like ‘division by zero’, ‘file open error’, etc. where an error handler needs to fix the issue. This allows the program to continue based on prior data saved. Source: Eyehunts Tutorial Just like Java, exceptions handling in Python is no different. It is a code embedded in a try block to run exceptions. Compare that to Java where catch clauses are used to catch the Exceptions. The same sort of Catch clause is used in Python that begins with except. Also, custom-made exception is possible in Python by using the raise statement where it forces a specified exception to take place. Reason to use exceptions Errors are always expected while writing a program in Python which requires a backup mechanism. Such a mechanism is set to handle any encountered errors and not doing so may crash the program completely. The reason to equip python program with the exception mechanism is to set and define a backup plan just in case any possible error situation erupts while executing it. Catch exceptions in Python Try statement is used for handling the exception in Python. A Try clause will consist of a raised exception associated with a particular, critical operation. For handling the exception the code is written within the Except Clause. The choice of performing a type of operation depends on the programmer once catching the exception is done. The below-defined program loops until the user enters an integer value having a valid reciprocal. A part of code that triggers an exception is contained inside the Try block. In case of absence of any exceptions then the normal flow of execution continues skipping the except block. And in case of exceptions raising the except block is caught. Checkout the example: The Output will be: Naming the exception is possible by using the ex_info() function that is present inside the sys module. It asks the user to make another attempt for naming it. Any unexpected values like 'a' or '1.3' will trigger the ValueError. Also, the return value of '0' leads to ZeroDivisionError. Exception handling in Python: try, except and finally There are instances where the suspicious code may raise exceptions which are placed inside such try statement block. Again, there is a code that is dedicated to handling such raised exceptions and the same is placed within the Except block. Below is an example of above-explained try and except statement when used in Python. try:   ** Operational/Suspicious Code except for SomeException:   ** Code to handle the exception How do they work in Python: The primarily used try block statements are triggered for checking whether or not there is any exception occurring within the code. In the event of non-occurrence of exception, the except block (Containing the exceptions handling statements) is executed post executing the try block. When the exception matches the predefined name as mentioned in 'SomeException' for handling the except block, it does the handling and enables the program to continue. In case of absence of any corresponding handlers that deals with the ones to be found in the except block then the activity of program execution is halted along with the error defining it. Defining Except without the exception To define the Except Clause isn’t always a viable option regardless of which programming language is used. As equipping the execution with the try-except clause is capable of handling all the possible types of exceptions. It will keep users ignorant about whether the exception was even raised in the first place. It is also a good idea to use the except statement without the exceptions field, for example some of the statements are defined below: try:    You do your operations here;    ...................... except:    If there is an exception, then execute this block.    ...................... else:    If there is no exception then execute this block.  OR, follow the below-defined syntax: try:   #do your operations except:   #If there is an exception raised, execute these statements else:   #If there is no exception, execute these statements Here is an example if the intent is to catch an exception within the file. This is useful when the intention is to read the file but it does not exist. try:   fp = open('example.txt', r) except:   print ('File is not found')   fp.close This example deals with opening the 'example.txt'. In such cases, when the called upon file is not found or does not exist then the code executes the except block giving the error read like 'File is not found'. Defining except clause for multiple exceptions It is possible to deal with multiple exceptions in a single block using the try statement. It allows doing so by enabling programmers to specify the different exception handlers. Also, it is recommended to define a particular exception within the code as a part of good programming practice. The better way out in such cases is to define the multiple exceptions using the same, above-mentioned except clause. And it all boils down to the process of execution wherein if the interpreter gets hold of a matching exception, then the code written under the except code will be executed. One way to do is by defining a tuple that can deal with the predefined multiple exceptions within the except clause. The below example shows the way to define such exceptions: try:    # do something  except (Exception1, Exception2, ..., ExceptionN):    # handle multiple exceptions    pass except:    # handle all other exceptions You can also use the same except statement to handle multiple exceptions as follows − try:    You do your operations here;    ...................... except(Exception1[, Exception2[,...ExceptionN]]]):    If there is an exception from the given exception list,     then execute this block.    ...................... else:    If there is no exception then execute this block.  Exception handling in Python using the try-finally clause Apart from implementing the try and except blocks within one, it is also a good idea to put together try and finally blocks. Here, the final block will carry all the necessary statements required to be executed regardless of the exception being raised in the try block. One benefit of using this method is that it helps in releasing external resources and clearing up the cache memories beefing up the program. Here is the pseudo-code for try..finally clause. try:    # perform operations finally:    #These statements must be executed Defining exceptions in try... finally block The example given below executes an event that shuts the file once all the operations are completed. try:    fp = open("example.txt",'r')    #file operations finally:    fp.close() Again, using the try statement in Python, it is wise to consider that it also comes with an optional clause – finally. Under any given circumstances, this code is executed which is usually put to use for releasing the additional external resource. It is not new for the developers to be connected to a remote data centre using a network. Also, there are chances of developers working with a file loaded with Graphic User Interface. Such situations will push the developers to clean up the used resources. Even if the resources used, yield successful results, such post-execution steps are always considered as a good practice. Actions like shutting down the GUI, closing a file or even disconnecting from a connected network written down in the finally block assures the execution of the code. The finally block is something that defines what must be executed regardless of raised exceptions. Below is the syntax used for such purpose: The file operations example below illustrates this very well: try: f = open("test.txt",encoding = 'utf-8') # perform file operations finally: f.close() Or In simpler terms: try:    You do your operations here;    ......................    Due to any exception, this may be skipped. finally:    This would always be executed.    ...................... Constructing such a block is a better way to ensure the file is closed even if the exception has taken place. Make a note that it is not possible to use the else clause along with the above-defined finally clause. Understanding user-defined exceptions Python users can create exceptions and it is done by deriving classes out of the built-in exceptions that come as standard exceptions. There are instances where displaying any specific information to users is crucial, especially upon catching the exception. In such cases, it is best to create a class that is subclassed from the RuntimeError. For that matter, the try block will raise a user-defined exception. The same is caught in the except block. Creating an instance of the class Networkerror will need the user to use variable e. Below is the syntax: class Networkerror(RuntimeError):    def __init__(self, arg):       self.args = arg   Once the class is defined, raising the exception is possible by following the below-mentioned syntax. try:    raise Networkerror("Bad hostname") except Networkerror,e:    print e.args Key points to remember Note that an exception is an error that occurs while executing the program indicating such events (error) occur though less frequently. As mentioned in the examples above, the most common exceptions are ‘divisible by 0’, ‘attempt to access non-existent file’ and ‘adding two non-compatible types’. Ensure putting up a try statement with a code where you are not sure whether or not the exception will occur. Specify an else block alongside try-except statement which will trigger when there is no exception raised in a try block. Author bio Shahid Mansuri Co-founder Peerbits, one of the leading software development company, USA, founded in 2011 which provides Python development services. Under his leadership, Peerbits used Python on a project to embed reports & researches on a platform that helped every user to access the dashboard that was freely available and also to access the dashboard that was exclusively available. His visionary leadership and flamboyant management style have yield fruitful results for the company. He believes in sharing his strong knowledge base with a learned concentration on entrepreneurship and business. Introducing Spleeter, a Tensorflow based python library that extracts voice and sound from any music track Fake Python libraries removed from PyPi when caught stealing SSH and GPG keys, reports ZDNet There’s more to learning programming than just writing code
Read more
  • 0
  • 0
  • 15649