Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Programming

1081 Articles
article-image-combining-vector-and-raster-datasets
Packt
22 Dec 2014
12 min read
Save for later

Combining Vector and Raster Datasets

Packt
22 Dec 2014
12 min read
This article by Michael Dorman, the author of Learning R for Geospatial Analysis, explores the interplay between vector and raster layers, and the way it is implemented in the raster package. The way rasters and vector layers can be interchanged and queried one according to the other will be demonstrated through examples. (For more resources related to this topic, see here.) Creating vector layers from a raster The opposite operation to rasterization, which has been presented in the previous section, is the creation of vector layers from raster data. The procedure of extracting features of interest out of rasters, in the form of vector layers, is often necessary for analogous reasons underlying rasterization—when the data held in a raster is better represented using a vector layer, within the context of specific subsequent analysis or visualization tasks. Scenarios where we need to create points, lines, and polygons from a raster can all be encountered. In this section, we are going to see an example of each. Raster-to-points conversion In raster-to-points conversion, each raster cell center (excluding NA cells) is converted to a point. The resulting point layer has an attribute table with the values of the respective raster cells in it. Conversion to points can be done with the rasterToPoints function. This function has a parameter named spatial that determines whether the returned object is going to be SpatialPointsDataFrame or simply a matrix holding the coordinates, and the respective cell values (spatial=FALSE, the default value). For our purposes, it is thus important to remember to specify spatial=TRUE. As an example of a raster, let's create a subset of the raster r, with only layers 1-2, rows 1-3, and columns 1-3: > u = r[[1:2]][1:3, 1:3, drop = FALSE] To make the example more instructive, we will place NA in some of the cells and see how this affects the raster-to-points conversion: > u[2, 3] = NA > u[[1]][3, 2] = NA Now, we will apply rasterToPoints to create a SpatialPointsDataFrame object named u_pnt out of u: > u_pnt = rasterToPoints(u, spatial = TRUE) Let's visually examine the result we got with the first layer of u serving as the background: > plot(u[[1]]) > plot(u_pnt, add = TRUE) The graphical output is shown in the following screenshot: We can see that a point has been produced at the center of each raster cell, except for the cell at position (2,3), where we assigned NA to both layers. However, at the (3,2) position, NA has been assigned to only one of the layers (the first one); therefore, a point feature has been generated there nevertheless. The attribute table of u_pnt has eight rows (since there are eight points) and two columns (corresponding to the raster layers). > u_pnt@data layer.1 layer.2 1 0.4242 0.4518 2 0.3995 0.3334 3 0.4190 0.3430 4 0.4495 0.4846 5 0.2925 0.3223 6 0.4998 0.5841 7     NA 0.5841 8 0.7126 0.5086 We can see that the seventh point feature, the one corresponding to the (3,2) raster position, indeed contains an NA value corresponding to layer 1. Raster-to-contours conversion Creating points (see the previous section) and polygons (see the next section) from a raster is relatively straightforward. In the former case, points are generated at cell centroids, while in the latter, rectangular polygons are drawn according to cell boundaries. On the other hand, lines can be created from a raster using various different algorithms designed for more specific purposes. Two common procedures where lines are generated based on a raster are constructing contours (lines connecting locations of equal value on the raster) and finding least-cost paths (lines going from one location to another along the easiest route when cost of passage is defined by raster values). In this section, we will see an example of how to create contours (readers interested in least-cost path calculation can refer to the gdistance package, which provides this capability in R). As an example, we will create contours from the DEM of Haifa (dem). Creating contours can be done using the rasterToContour function. This function accepts a RasterLayer object and returns a SpatialLinesDataFrame object with the contour lines. The rasterToContour function internally uses the base function contourLines, and arguments can be passed to the latter as part of the rasterToContour function call. For example, using the levels parameter, we can specify the breaks where contours will be generated (rather than letting them be determined automatically). The raster dem consists of elevation values ranging between -14 meters and 541 meters: > range(dem[], na.rm = TRUE) [1] -14 541 Therefore, we may choose to generate six contour lines, at 0, 100, 200, …, 500 meter levels: > dem_contour = rasterToContour(dem, levels = seq(0, 500, 100)) Now, we will plot the resulting SpatialLinesDataFrame object on top of the dem raster: > plot(dem) > plot(dem_contour, add = TRUE) The graphical output is shown in the following screenshot: Mount Carmel is densely covered with elevation contours compared to the plains surrounding it, which are mostly within the 0-100 meter elevation range and thus, have only few a contour lines. Let's take a look at the attribute table of dem_contour: > dem_contour@data    level C_1     0 C_2   100 C_3   200 C_4   300 C_5   400 C_6   500 Indeed, the layer consists of six line features—one for each break we specified with the levels argument. Raster-to-polygons conversion As mentioned previously, raster-to-polygons conversion involves the generation of rectangular polygons in the place of each raster cell (once again, excluding NA cells). Similar to the raster-to-points conversion, the resulting attribute table contains the respective raster values for each polygon created. The conversion to polygons is most useful with categorical rasters when we would like to generate polygons defining certain areas in order to exploit the analysis tools this type of data is associated with (such as extraction of values from other rasters, geometry editing, and overlay). Creation of polygons from a raster can be performed with a function whose name the reader may have already guessed, rasterToPolygons. A useful option in this function is to immediately dissolve the resulting polygons according to their attribute table values; that is, all polygons having the same value are dissolved into a single feature. This functionality internally utilizes the rgeos package and it can be triggered by specifying dissolve=TRUE. In our next example, we will visually compare the average NDVI time series of Lahav and Kramim forests (see earlier), based on all of our Landsat (three dates) and MODIS (280 dates) satellite images. In this article, we will only prepare the necessary data by going through the following intermediate steps: Creating the Lahav and Kramim forests polygonal layer. Extracting NDVI values from the satellite images. Creating a data.frame object that can be passed to graphical functions later. Commencing with the first step, using l_rec_focal_clump, we will first create a polygonal layer holding all NDVI>0.2 patches, then subset only those two polygons corresponding to Lahav and Kramim forests. The former is achieved using rasterToPolygons with dissolve=TRUE, converting the patches in l_rec_focal_clumpto 507 individual polygons in a new SpatialPolygonsDataFrame that we hereby name pol: > pol = rasterToPolygons(l_rec_focal_clump, dissolve = TRUE) Plotting pol will show that we have quite a few large patches and many small ones. Since the Lahav and Kramim forests are relatively large, to make things easier, we can omit all polygons with area less than or equal to 1 km2: > pol$area = gArea(pol, byid = TRUE) / 1000^2 > pol = pol[pol$area > 1, ] The attribute table shows that we are left with eight polygons, with area sizes of 1-10 km2. The clumps column, by the way, is where the original l_rec_focal_clump raster value (the clump ID) has been kept ("clumps" is the name of the l_rec_focal_clump raster layer from which the values came). > pol@data    clumps   area 112     2 1.2231 114   200 1.3284 137   221 1.9314 203   281 9.5274 240   314 6.7842 371   432 2.0007 445     5 10.2159 460     56 1.0998 Let's make a map of pol: > plotRGB(l_00, r = 3, g = 2, b = 1, stretch = "lin") > plot(pol, border = "yellow", lty = "dotted", add = TRUE) The graphical output is shown in the following screenshot: The preceding screenshot shows the continuous NDVI>0.2 patches, which are 1 km2 or larger, within the studied area. Two of these, as expected, are the forests we would like to examine. How can we select them? Obviously, we could export pol to a Shapefile and select the features of interest interactively in a GIS software (such as QGIS), then import the result back into R to continue our analysis. The raster package also offers some capabilities for interactive selection (that we do not cover here); for example, a function named click can be used to obtain the properties of the pol features we click in a graphical window such as the one shown in the preceding screenshot. However, given the purpose of this book, we will try to write a code to make the selection automatically without further user input. To write a code that makes the selection, we must choose a certain criterion (either spatial or nonspatial) that separates the features of interest. In this case, for example, we can see that the pol features we wish to select are those closest to Lahav Kibbutz. Therefore, we can utilize the towns point layer (see earlier) to find the distance of each polygon from Lahav Kibbutz, and select the two most proximate ones. Using the gDistance function, we will first find out the distances between each polygon in pol and each point in towns: > dist_towns = gDistance(towns, pol, byid = TRUE) > dist_towns              1         2 112 14524.94060 12697.151 114 5484.66695 7529.195 137 3863.12168 5308.062 203   29.48651 1119.090 240 1910.61525 6372.634 371 11687.63594 11276.683 445 12751.21123 14371.268 460 14860.25487 12300.319 The returned matrix, named dist_towns, contains the pairwise distances, with rows corresponding to the pol feature and columns corresponding to the towns feature. Since Lahav Kibbutz corresponds to the first towns feature (column "1"), we can already see that the fourth and fifth pol features (rows "203" and "240") are the most proximate ones, thus corresponding to the Lahav and Kramim forests. We could subset both forests by simply using their IDs—pol[c("203","240"),]. However, as always, we are looking for general code that will select, in this case, the two closest features irrespective of the specific IDs or row indices. For this purpose, we can use the order function, which we have not encountered so far. This function, given a numeric vector, returns the element indices in an increasing order according to element values. For example, applying order to the first column of dist_towns, we can see that the smallest element in this column is in the fourth row, the second smallest is in the fifth row, the third smallest is in the third row, and so on: > dist_order = order(dist_towns[, 1]) > dist_order [1] 4 5 3 2 6 7 1 8 We can use this result to select the relevant features of pol as follows: > forests = pol[dist_order[1:2], ] The subset SpatialPolygonsDataFrame, named forests, now contains only the two features from pol corresponding to the Lahav and Kramim forests. > forests@data    clumps   area 203   281 9.5274 240   314 6.7842 Let's visualize forests within the context of the other data we have by now. We will plot, once again, l_00 as the RGB background and pol on top of it. In addition, we will plot forests (in red) and the location of Lahav Kibbutz (as a red point). We will also add labels for each feature in pol, corresponding to its distance (in meters) from Lahav Kibbutz: > plotRGB(l_00, r = 3, g = 2, b = 1, stretch = "lin") > plot(towns[1, ], col = "red", pch = 16, add = TRUE) > plot(pol, border = "yellow", lty = "dotted", add = TRUE) > plot(forests, border = "red", lty = "dotted", add = TRUE) > text(gCentroid(pol, byid = TRUE), + round(dist_towns[,1]), + col = "White") The graphical output is shown in the following screenshot: The preceding screenshot demonstrates that we did indeed correctly select the features of interest. We can also assign the forest names to the attribute table of forests, relying on our knowledge that the first feature of forests (ID "203") is larger and more proximate to Lahav Kibbutz and corresponds to the Lahav forest, while the second feature (ID "240") corresponds to Kramim. > forests$name = c("Lahav", "Kramim") > forests@data    clumps   area   name 203   281 9.5274 Lahav 240   314 6.7842 Kramim We now have a polygonal layer named forests, with two features delineating the Lahav and Kramim forests, named accordingly in the attribute table. In the next section, we will proceed with extracting the NDVI data for these forests. Summary In this article, we closed the gap between the two main spatial data types (rasters and vector layers). We now know how to make the conversion from a vector layer to raster and vice versa, and we can transfer the geometry and data components from one data model to another when the need arises. We also saw how raster values can be extracted from a raster according to a vector layer, which is a fundamental step in many analysis tasks involving raster data. Resources for Article:  Further resources on this subject: Data visualization[article] Machine Learning in Bioinformatics[article] Specialized Machine Learning Topics[article]
Read more
  • 0
  • 0
  • 2309

article-image-pipeline-and-producer-consumer-design-patterns
Packt
20 Dec 2014
48 min read
Save for later

Pipeline and Producer-consumer Design Patterns

Packt
20 Dec 2014
48 min read
In this article created by Rodney Ringler, the author of C# Multithreaded and Parallel Programming, we will explore two popular design patterns to solve concurrent problems—Pipeline and producer-consumer, which are used in developing parallel applications using the TPL. A Pipeline design is one where an application is designed with multiple tasks or stages of functionality with queues of work items between them. So, for each stage, the application will read from a queue of work to be performed, execute the work on that item, and then queue the results for the next stage. By designing the application this way, all of the stages can execute in parallel. Each stage just reads from its work queue, performs the work, and puts the results of the work into the queue for the next stage. Each stage is a task and can run independently of the other stages or tasks. They continue executing until their queue is empty and marked completed. They also block and wait for more work items if the queue is empty but not completed. The producer-consumer design pattern is a similar concept but different. In this design, we have a set of functionality that produces data that is then consumed by another set of functionality. Each set of functionality is a TPL task. So, we have a producer task and a consumer task, with a buffer between them. Each of these tasks can run independently of each other. We can also have multiple producer tasks and multiple consumer tasks. The producers run independently and produce queue results to the buffer. The consumers run independently and dequeue from the buffer and perform work on the item. The producer can block if the buffer is full and wait for room to become available before producing more results. Also, the consumer can block if the buffer is empty, waiting on more results to be available to consume. In this article, you will learn the following: Designing an application with a Pipeline design Designing an application with a producer-consumer design Learning how to use BlockingCollection Learning how to use BufferedBlocks Understanding the classes of the System.Threading.Tasks.Dataflow library (For more resources related to this topic, see here.) Pipeline design pattern The Pipeline design is very useful in parallel design when you can divide an application up into series of tasks to be performed in such a way that each task can run concurrently with other tasks. It is important that the output of each task is in the same order as the input. If the order does not matter, then a parallel loop can be performed. When the order matters and we don't want to wait until all items have completed task A before the items start executing task B, then a Pipeline implementation is perfect. Some applications that lend themselves to pipelining are video streaming, compression, and encryption. In each of these examples, we need to perform a set of tasks on the data and preserve the data's order, but we do not want to wait for each item of data to perform a task before any of the data can perform the next task. The key class that .NET has provided for implementing this design pattern is BlockingCollection of the System.Collections.Concurrent namespace. The BlockingCollection class was introduced with .NET 4.5. It is a thread-safe collection specifically designed for producer-consumer and Pipeline design patterns. It supports concurrently adding and removing items by multiple threads to and from the collection. It also has methods to add and remove that block when the collection is full or empty. You can specify a maximum collection size to ensure a producing task that outpaces a consuming task does not make the queue too large. It supports cancellation tokens. Finally, it supports enumerations so that you can use the foreach loop when processing items of the collection. A producer of items to the collection can call the CompleteAdding method when the last item of data has been added to the collection. Until this method is called if a consumer is consuming items from the collection with a foreach loop and the collection is empty, it will block until an item is put into the collection instead of ending the loop. Next, we will see a simple example of a Pipeline design implementation using an encryption program. This program will implement three stages in our pipeline. The first stage will read a text file character-by-character and place each character into a buffer (BlockingCollection). The next stage will read each character out of the buffer and encrypt it by adding 1 to its ASCII number. It will then place the new character into our second buffer and write it to an encryption file. Our final stage will read the character out of the second buffer, decrypt it to its original character, and write it out to a new file and to the screen. As you will see, stages 2 and 3 will start processing characters before stage 1 has finished reading all the characters from the input file. And all of this will be done while maintaining the order of the characters so that the final output file is identical to the input file: Let's get started. How to do it First, let's open up Visual Studio and create a new Windows Presentation Foundation (WPF) application named PipeLineApplication and perform the following steps: Create a new class called Stages.cs. Next, make sure it has the following using statements. using System; using System.Collections.Concurrent; using System.Collections.Generic; using System.IO; using System.Linq; using System.Text; using System.Threading.Tasks; using System.Threading; In the MainWindow.xaml.cs file, make sure the following using statements are present: using System; using System.Collections.Concurrent; using System.Collections.Generic; using System.IO; using System.Linq; using System.Text; using System.Threading.Tasks; using System.Threading; Next, we will add a method for each of the three stages in our pipeline. First, we will create a method called FirstStage. It will take two parameters: one will be a BlockingCollection object that will be the output buffer of this stage, and the second will be a string pointing to the input data file. This will be a text file containing a couple of paragraphs of text to be encrypted. We will place this text file in the projects folder on C:. The FirstStage method will have the following code: public void FirstStage(BlockingCollection<char> output, String PipelineInputFile)        {            String DisplayData = "";            try            {                foreach (char C in GetData(PipelineInputFile))                { //Displayed characters read in from the file.                   DisplayData = DisplayData + C.ToString();   // Add each character to the buffer for the next stage.                    output.Add(C);                  }            }            finally            {                output.CompleteAdding();             }      } Next, we will add a method for the second stage called StageWorker. This method will not return any values and will take three parameters. One will be a BlockingCollection value that will be its input buffer, the second one will be the output buffer of the stage, and the final one will be a file path to store the encrypted text in a data file. The code for this method will look like this: public void StageWorker(BlockingCollection<char> input, BlockingCollection<char> output, String PipelineEncryptFile)        {            String DisplayData = "";              try            {                foreach (char C in input.GetConsumingEnumerable())                {                    //Encrypt each character.                    char encrypted = Encrypt(C);                      DisplayData = DisplayData + encrypted.ToString();   //Add characters to the buffer for the next stage.                    output.Add(encrypted);                  }   //write the encrypted string to the output file.                 using (StreamWriter outfile =                            new StreamWriter(PipelineEncryptFile))                {                    outfile.Write(DisplayData);                }              }            finally            {                output.CompleteAdding();            }        } Now, we will add a method for the third and final stage of the Pipeline design. This method will be named FinalStage. It will not return any values and will take two parameters. One will be a BlockingCollection object that is the input buffer and the other will be a string pointing to an output data file. It will have the following code in it: public void FinalStage(BlockingCollection<char> input, String PipelineResultsFile)        {            String OutputString = "";            String DisplayData = "";              //Read the encrypted characters from the buffer, decrypt them, and display them.            foreach (char C in input.GetConsumingEnumerable())            {                //Decrypt the data.                char decrypted = Decrypt(C);                  //Display the decrypted data.                DisplayData = DisplayData + decrypted.ToString();                  //Add to the output string.                OutputString += decrypted.ToString();              }              //write the decrypted string to the output file.            using (StreamWriter outfile =                        new StreamWriter(PipelineResultsFile))            {                outfile.Write(OutputString);            }        } Now that we have methods for the three stages of our pipeline, let's add a few utility methods. The first of these methods will be one that reads in the input data file and places each character in the data file in a List object. This method will take a string parameter that has a filename and will return a List object of characters. It will have the following code: public List<char> GetData(String PipelineInputFile)        {            List<char> Data = new List<char>();              //Get the Source data.            using (StreamReader inputfile = new StreamReader(PipelineInputFile))            {                while (inputfile.Peek() >= 0)                {                    Data.Add((char)inputfile.Read());                }              }              return Data;        } Now we will need a method to encrypt the characters. This will be a simple encryption method. The encryption method is not really important to this exercise. This exercise is designed to demonstrate the Pipeline design, not implement the world's toughest encryption. This encryption will simply take each character and add one to its ASCII numerical value. The method will take a character type as an input parameter and return a character. The code for it will be as follows: public char Encrypt(char C)        {            //Take the character, convert to an int, add 1, then convert back to a character.            int i = (int)C;            i = i + 1;            C = Convert.ToChar(i);              return C; } Now we will add one final method to the Stages class to decrypt a character value. It will simply do the reverse of the encrypt method. It will take the ASCII numerical value and subtract 1. The code for this method will look like this: public char Decrypt(char C)      {            int i = (int)C;            i = i - 1;            C = Convert.ToChar(i);              return C;        } Now that we are done with the Stages class, let's switch our focus back to the MainWindow.xaml.cs file. First, you will need to add three using statements. They are for the StreamReader, StreamWriter, Threads, and BlockingCollection classes: using System.Collections.Concurrent; using System.IO; using System.Threading; At the top of the MainWindow class, we need four variables available for the whole class. We need three strings that point to our three data files—the input data, encrypted data, and output data. Then we will need a Stages object. These declarations will look like this: private static String PipelineResultsFile = @"c:projectsOutputData.txt";        private static String PipelineEncryptFile = @"c:projectsEncryptData.txt";        private static String PipelineInputFile = @"c:projectsInputData.txt";        private Stages Stage; Then, in the MainWindow constructor method, right after the InitializeComponent call, add a line to instantiate our Stages object: //Create the Stage object and register the event listeners to update the UI as the stages work. Stage = new Stages(); Next, add a button to the MainWindow.xaml file that will initiate the pipeline and encryption. Name this button control butEncrypt, and set its Content property to Encrypt File. Next, add a click event handler for this button in the MainWindow.xaml.cs file. Its event handler method will be butEncrypt_Click and will contain the main code for this application. It will instantiate two BlockingCollection objects for two queues. One queue between stages 1 and 2, and one queue between stages 2 and 3. This method will then create a task for each stage that executes the corresponding methods from the Stages classes. It will then start these three tasks and wait for them to complete. Finally, it will write the output of each stage to the input, encrypted, and results data files and text blocks for viewing. The code for it will look like the following code: private void butEncrpt_Click(object sender, RoutedEventArgs e)        {            //PipeLine Design Pattern              //Create queues for input and output to stages.            int size = 20;            BlockingCollection<char> Buffer1 = new BlockingCollection<char>(size);            BlockingCollection<char> Buffer2 = new BlockingCollection<char>(size);              TaskFactory tasks = new TaskFactory(TaskCreationOptions.LongRunning, TaskContinuationOptions.None);              Task Stage1 = tasks.StartNew(() => Stage.FirstStage(Buffer1, PipelineInputFile));            Task Stage2 = tasks.StartNew(() => Stage.StageWorker(Buffer1, Buffer2, PipelineEncryptFile));            Task Stage3 = tasks.StartNew(() => Stage.FinalStage(Buffer2, PipelineResultsFile));              Task.WaitAll(Stage1, Stage2, Stage3);              //Display the 3 files.            using (StreamReader inputfile = new StreamReader(PipelineInputFile))            {                while (inputfile.Peek() >= 0)                {                    tbStage1.Text = tbStage1.Text + (char)inputfile.Read();                }              }            using (StreamReader inputfile = new StreamReader(PipelineEncryptFile))            {                 while (inputfile.Peek() >= 0)                {                    tbStage2.Text = tbStage2.Text + (char)inputfile.Read();                }              }            using (StreamReader inputfile = new StreamReader(PipelineResultsFile))             {                while (inputfile.Peek() >= 0)                {                    tbStage3.Text = tbStage3.Text + (char)inputfile.Read();                }              }      } One last thing. Let's add three textblocks to display the outputs. We will call these tbStage1, tbStage2, and tbStage3. We will also add three label controls with the text Input File, Encrypted File, and Output File. These will be placed by the corresponding textblocks. Now, the MainWindow.xaml file should look like the following screenshot: Now we will need an input data file to encrypt. We will call this file InputData.txt and put it in the C:projects folder on our computer. For our example, we have added the following text to it: We are all finished and ready to try it out. Compile and run the application and you should have a window that looks like the following screenshot: Now, click on the Encrypt File button and you should see the following output: As you can see, the input and output files look the same and the encrypted file looks different. Remember that Input File is the text we put in the input data text file; this is the input from the end of stage 1 after we have read the file in to a character list. Encrypted File is the output from stage 2 after we have encrypted each character. Output File is the output of stage 3 after we have decrypted the characters again. It should match Input File. Now, let's take a look at how this works. How it works Let's look at the butEncrypt click event handler method in the MainWindow.xaml.cs file, as this is where a lot of the action takes place. Let's examine the following lines of code:            //Create queues for input and output to stages.            int size = 20;            BlockingCollection<char> Buffer1 = new BlockingCollection<char>(size);            BlockingCollection<char> Buffer2 = new BlockingCollection<char>(size);            TaskFactory tasks = new TaskFactory(TaskCreationOptions.LongRunning, TaskContinuationOptions.None);              Task Stage1 = tasks.StartNew(() => Stage.FirstStage(Buffer1, PipelineInputFile));            Task Stage2 = tasks.StartNew(() => Stage.StageWorker(Buffer1, Buffer2, PipelineEncryptFile));            Task Stage3 = tasks.StartNew(() => Stage.FinalStage(Buffer2, PipelineResultsFile)); First, we create two queues that are implemented using BlockingCollection objects. Each of these is set with a size of 20 items. These two queues take a character datatype. Then we create a TaskFactory object and use it to start three tasks. Each task uses a lambda expression that executes one of the stages methods from the Stages class—FirstStage, StageWorker, and FinalStage. So, now we have three separate tasks running besides the main UI thread. Stage1 will read the input data file character by character and place each character in the queue Buffer1. Remember that this queue can only hold 20 items before it will block the FirstStage method waiting on room in the queue. This is how we know that Stage2 starts running before Stage1 completes. Otherwise, Stage1 will only queue the first 20 characters and then block. Once Stage1 has read all of the characters from the input file and placed them into Buffer1, it then makes the following call:            finally            {                output.CompleteAdding();            } This lets the BlockingCollection instance, Buffer1, to know that there are no more items to be put in the queue. So, when Stage2 has emptied the queue after Stage1 has called this method, it will not block but will instead continue until completion. Prior to the CompleteAdding method call, Stage2 will block if Buffer1 is empty, waiting until more items are placed in the queue. This is why a BlockingCollection instance was developed for Pipeline and producer-consumer applications. It provides the perfect mechanism for this functionality. When we created the TaskFactory, we used the following parameter: TaskCreationOptions.LongRunning This tells the threadpool that these tasks may run for a long time and could occasionally block waiting on their queues. In this way, the threadpool can decide how to best manage the threads allocated for these tasks. Now, let's look at the code in Stage2—the StageWorker method. We need a way to remove items in an enumerable way so that we can iterate over the queues items with a foreach loop because we do not know how many items to expect. Also, since BlockingCollection objects support multiple consumers, we need a way to remove items that no other consumer might remove. We use this method of the BlockingCollection class: foreach (char C in input.GetConsumingEnumerable()) This allows multiple consumers to remove items from a BlockingCollection instance while maintaining the order of the items. To further improve performance of this application (assuming we have enough available processing cores), we could create a fourth task that also runs the StageWorker method. So, then we would have two stages and two tasks running. This might be helpful if there are enough processing cores and stage 1 runs faster than stage 2. If this happens, it will continually fill the queue and block until space becomes available. But if we run multiple stage 2 tasks, then we will be able to keep up with stage 1. Then, finally we have this line of code: Task.WaitAll(Stage1, Stage2, Stage3); This tells our button handler to wait until all of the tasks are complete. Once we have called the CompleteAdding method on each BlockingCollection instance and the buffers are then emptied, all of our stages will complete and the TaskFactory.WaitAll command will be satisfied and this method on the UI thread can complete its processing, which in this application is to update the UI and data files:            //Display the 3 files.            using (StreamReader inputfile = new StreamReader(PipelineInputFile))            {                while (inputfile.Peek() >= 0)                {                    tbStage1.Text = tbStage1.Text + (char)inputfile.Read();                }              }            using (StreamReader inputfile = new StreamReader(PipelineEncryptFile))            {                while (inputfile.Peek() >= 0)                {                    tbStage2.Text = tbStage2.Text + (char)inputfile.Read();                }              }            using (StreamReader inputfile = new StreamReader(PipelineResultsFile))            {                while (inputfile.Peek() >= 0)                {                    tbStage3.Text = tbStage3.Text + (char)inputfile.Read();                }              } Next, experiment with longer running, more complex stages and multiple consumer stages. Also, try stepping through the application with the Visual Studio debugger. Make sure you understand the interaction between the stages and the buffers. Explaining message blocks Let's talk for a minute about message blocks and the TPL. There is a new library that Microsoft has developed as part of the TPL, but it does not ship directly with .NET 4.5. This library is called the TPL Dataflow library. It is located in the System.Threading.Tasks.Dataflow namespace. It comes with various dataflow components that assist in asynchronous concurrent applications where messages need to be passed between multiple tasks or the data needs to be passed when it becomes available, as in the case of a web camera streaming video. The Dataflow library's message blocks are very helpful for design patterns such as Pipeline and producer-consumer where you have multiple producers producing data that can be consumed by multiple consumers. The two that we will take a look at are BufferBlock and ActionBlock. The TPL Dataflow library contains classes to assist in message passing and parallelizing I/O-heavy applications that have a lot of throughput. It provides explicit control over how data is buffered and passed. Consider an application that asynchronously loads large binary files from storage and manipulates that data. Traditional programming requires that you use callbacks and synchronization classes, such as locks, to coordinate tasks and have access to data that is shared. By using the TPL Dataflow objects, you can create objects that process image files as they are read in from a disk location. You can set how data is handled when it becomes available. Because the CLR runtime engine manages dependencies between data, you do not have to worry about synchronizing access to shared data. Also, since the CLR engine schedules the work depending on the asynchronous arrival of data, the TPL Dataflow objects can improve performance by managing the threads the tasks run on. In this section, we will cover two of these classes, BufferBlock and ActionBlock. The TPL Dataflow library (System.Threading.Tasks.Dataflow) does not ship with .NET 4.5. To install System.Threading.Tasks.Dataflow, open your project in Visual Studio, select Manage NuGet Packages from under the Project menu and then search online for Microsoft.Tpl.Dataflow. BufferBlock The BufferBlock object in the Dataflow library provides a buffer to store data. The syntax is, BufferBlock<T>. The T indicates that the datatype is generic and can be of any type. All static variables of this object type are guaranteed to be thread-safe. BufferBlock is an asynchronous message structure that stores messages in a first-in-first-out queue. Messages can be "posted" to the queue by multiple producers and "received" from the queue by multiple consumers. The TPL DatafLow library provides interfaces for three types of objects—source blocks, target blocks, and propagator blocks. BufferBlock is a general-purpose message block that can act as both a source and a target message buffer, which makes it perfect for a producer-consumer application design. To act as both a source and a target, it implements two interfaces defined by the TPL Dataflow library—ISourceBlock<TOutput> and ITargetBlock<TOutput>. So, in the application that we will develop in the Producer-consumer design pattern section of this article, you will see that the producer method implements BufferBlock using the ITargetBlock interface and the consumer implements BufferBlock with the ISourceBlock interface. This will be the same BufferBlock object that they will act on but by defining their local objects with a different interface there will be different methods available to use. The producer method will have Post and Complete methods, and the consumer method will use the OutputAvailableAsync and Receive methods. The BufferBlock object only has two properties, namely Count, which is a count of the number of data messages in the queue, and Completion, which gets a task that is an asynchronous operation and completion of the message block. The following is a set of methods for this class: Referenced from http://msdn.microsoft.com/en-us/library/hh160414(v=vs.110).aspx Here is a list of the extension methods provided by the interfaces that it implements: Referenced from http://msdn.microsoft.com/en-us/library/hh160414(v=vs.110).aspx Finally, here are the interface references for this class: Referenced from http://msdn.microsoft.com/en-us/library/hh160414(v=vs.110).aspx So, as you can see, these interfaces make using the BufferBlock object as a general-purpose queue between stages of a pipeline very easy. This technique is also useful between producers and consumers in a producer-consumer design pattern. ActionBlock Another very useful object in the Dataflow library is ActionBlock. Its syntax is ActionBlock<TInput>, where TInput is an Action object. ActionBlock is a target block that executes a delegate when a message of data is received. The following is a very simple example of using an ActionBlock:            ActionBlock<int> action = new ActionBlock<int>(x => Console.WriteLine(x));              action.Post(10); In this sample piece of code, the ActionBlock object is created with an integer parameter and executes a simple lambda expression that does a Console.WriteLine when a message of data is posted to the buffer. So, when the action.Post(10) command is executed, the integer, 10, is posted to the ActionBlock buffer and then the ActionBlock delegate, implemented as a lambda expression in this case, is executed. In this example, since this is a target block, we would then need to call the Complete method to ensure the message block is completed. Another handy method of the BufferBlock is the LinkTo method. This method allows you to link ISourceBlock to ITargetBlock. So, you can have a BufferBlock that is implemented as an ISourceBlock and link it to an ActionBlock since it is an ITargetBlock. In this way, an Action delegate can be executed when a BufferBlock receives data. This does not dequeue the data from the message block. It just allows you to execute some task when data is received into the buffer. ActionBlock only has two properties, namely InputCount, which is a count of the number of data messages in the queue, and Completion, which gets a task that is an asynchronous operation and completion of the message block. It has the following methods: Referenced from http://msdn.microsoft.com/en-us/library/hh194684(v=vs.110).aspx The following extension methods are implemented from its interfaces: Referenced from http://msdn.microsoft.com/en-us/library/hh194684(v=vs.110).aspx Also, it implements the following interfaces: Referenced from http://msdn.microsoft.com/en-us/library/hh194684(v=vs.110).aspx Now that we have examined a little of the Dataflow library that Microsoft has developed, let's use it in a producer-consumer application. Producer-consumer design pattern Now, that we have covered the TPL's Dataflow library and the set of objects it provides to assist in asynchronous message passing between concurrent tasks, let's take a look at the producer-consumer design pattern. In a typical producer-consumer design, we have one or more producers putting data into a queue or message data block. Then we have one or more consumers taking data from the queue and processing it. This allows for asynchronous processing of data. Using the Dataflow library objects, we can create a consumer task that monitors a BufferBlock and pulls items of the data from it when they arrive. If no items are available, the consumer method will block until items are available or the BufferBlock has been set to Complete. Because of this, we can start our consumer at any time, even before the producer starts to put items into the queue. Then we create one or more tasks that produce items and place them into the BufferBlock. Once the producers are finished processing all items of data to the BufferBlock, they can mark the block as Complete. Until then, the BufferBlock object is still available to add items into. This is perfect for long-running tasks and applications when we do not know when the data will arrive. Because the producer task is implementing an input parameter of a BufferBlock as an ITargetBlock object and the consumer task is implementing an input parameter of a BufferBlock as an ISourceBlock, they can both use the same BufferBlock object but have different methods available to them. One has methods to produces items to the block and mark it complete. The other one has methods to receive items and wait for more items until the block is marked complete. In this way, the Dataflow library implements the perfect object to act as a queue between our producers and consumers. Now, let's take a look at the application we developed previously as a Pipeline design and modify it using the Dataflow library. We will also remove a stage so that it just has two stages, one producer and one consumer. How to do it The first thing we need to do is open Visual Studio and create a new console application called ProducerConsumerConsoleApp. We will use a console application this time just for ease. Our main purpose here is to demonstrate how to implement the producer-consumer design pattern using the TPL Dataflow library. Once you have opened Visual Studio and created the project, we need to perform the following steps: First, we need to install and add a reference to the TPL Dataflow library. The TPL Dataflow library (System.Threading.Tasks.Dataflow) does not ship with .NET 4.5. Select Manage NuGet Packages from under the Project menu and then search online for Microsoft.Tpl.Dataflow. Now, we will need to add two using statements to our program. One for StreamReader and StreamWriter and one for the BufferBlock object: using System.Threading.Tasks.Dataflow; using System.IO; Now, let's add two static strings that will point to our input data file and the encrypted data file that we output: private static String PipelineEncryptFile = @"c:projectsEncryptData.txt";        private static String PipelineInputFile = @"c:projectsInputData.txt"; Next, let's add a static method that will act as our producer. This method will have the following code:        // Our Producer method.        static void Producer(ITargetBlock<char> Target)        {            String DisplayData = "";              try            {                foreach (char C in GetData(PipelineInputFile))                {                      //Displayed characters read in from the file.                    DisplayData = DisplayData + C.ToString();                      // Add each character to the buffer for the next stage.                    Target.Post(C);                  }            }              finally            {                Target.Complete();            }          } Then we will add a static method to perform our consumer functionality. It will have the following code:        // This is our consumer method. IT runs asynchronously.        static async Task<int> Consumer(ISourceBlock<char> Source)        {            String DisplayData = "";              // Read from the source buffer until the source buffer has no            // available output data.            while (await Source.OutputAvailableAsync())            {                    char C = Source.Receive();                      //Encrypt each character.                    char encrypted = Encrypt(C);                      DisplayData = DisplayData + encrypted.ToString();              }              //write the decrypted string to the output file.            using (StreamWriter outfile =                         new StreamWriter(PipelineEncryptFile))            {                outfile.Write(DisplayData);            }              return DisplayData.Length;        } Then, let's create a simple static helper method to read our input data file and put it in a List collection character by character. This will give us a character list for our producer to use. The code in this method will look like this:        public static List<char> GetData(String PipelineInputFile)        {            List<char> Data = new List<char>();              //Get the Source data.            using (StreamReader inputfile = new StreamReader(PipelineInputFile))            {                while (inputfile.Peek() >= 0)                {                    Data.Add((char)inputfile.Read());                }              }              return Data;        } Next, we will add a static method to encrypt our characters. This method will work like the one we used in our pipelining application. It will add one to the ASCII numerical value of the character:        public static char Encrypt(char C)        {            //Take the character, convert to an int, add 1, then convert back to a character.            int i = (int)C;            i = i + 1;            C = Convert.ToChar(i);              return C;        } Then, we need to add the code for our Then, we need to add the code for our Main method. This method will start our consumer and producer tasks. Then, when they have completed processing, it will display the results in the console. The code for this method looks like this:        static void Main(string[] args)        {            // Create the buffer block object to use between the producer and consumer.            BufferBlock<char> buffer = new BufferBlock<char>();              // The consumer method runs asynchronously. Start it now.            Task<int> consumer = Consumer(buffer);              // Post source data to the dataflow block.            Producer(buffer);              // Wait for the consumer to process all data.            consumer.Wait();              // Print the count of characters from the input file.            Console.WriteLine("Processed {0} bytes from input file.", consumer.Result);              //Print out the input file to the console.            Console.WriteLine("rnrn");            Console.WriteLine("This is the input data file. rn");            using (StreamReader inputfile = new StreamReader(PipelineInputFile))            {                while (inputfile.Peek() >= 0)                {                    Console.Write((char)inputfile.Read());                }              }              //Print out the encrypted file to the console.            Console.WriteLine("rnrn");            Console.WriteLine("This is the encrypted data file. rn");            using (StreamReader encryptfile = new StreamReader(PipelineEncryptFile))            {                while (encryptfile.Peek() >= 0)                {                    Console.Write((char)encryptfile.Read());                }              }             //Wait before closing the application so we can see the results.            Console.ReadLine();        } That is all the code that is needed. Now, let's build and run the application using the following input data file: Once it runs and completes, your output should look like the following screenshot: Now, try this with your own data files and inputs. Let's examine what happened and how this works. How it works First we will go through the Main method. The first thing Main does is create a BufferBlock object called buffer. This will be used as the queue of items between our producer and consumer. This BufferBlock is defined to accept character datatypes. Next, we start our consumer task using this command: Task<int> consumer = Consumer(buffer); Also, note that when this buffer object goes into the consumer task, it is cast as ISourceBlock. Notice the method header of our consumer: static async Task<int> Consumer(ISourceBlock<char> Source) Next, our Main method starts our producer task using the following command: Producer(buffer); Then we wait until our consumer task finishes, using this command: consumer.Wait(); So, now our Main method just waits. Its work is done for now. It has started both the producer and consumer tasks. Now our consumer is waiting for items to appear in its BufferBlock so it can process them. The consumer will stay in the following loop until all items are removed from the message block and the block has been completed, which is done by someone calling its Complete method:      while (await Source.OutputAvailableAsync())            {                    char C = Source.Receive();                      //Encrypt each character.                    char encrypted = Encrypt(C);                      DisplayData = DisplayData + encrypted.ToString();              } So, now our consumer task will loop asynchronously, removing items from the message queue as they appear. It uses the following command in the while loop to do this: await Source.OutputAvailableAsync()) Likewise, other consumer tasks can run at the same time and do the same thing. If the producer is adding items to the block quicker than the consumer can process them, then adding another consumer will improve performance. Once an item is available, then the consumer calls the following command to get the item from the buffer: char C = Source.Receive(); Since the buffer contains items of type character, we place the item received into a character value. Then the consumer processes it by encrypting the character and appending it to our display string: Now, let's look at the consumer. The consumer first gets its data by calling the following command: GetData(PipelineInputFile) This method returns a List collection of characters that has an item for each character in the input data file. Now the producer iterates through the collection and uses the following command to place each item into the buffer block: Target.Post(C); Also, notice in the method header for our consumer that we cast our buffer as an ITargetBlock type: static void Producer(ITargetBlock<char> Target) Once the producer is done processing characters and adding them to the buffer, it officially closes the BufferBlock object using this command: Target.Complete(); That is it for the producer and consumer. Once the Main method is done waiting on the consumer to finish, it then uses the following code to write out the number of characters processed, the input data, and the encrypted data:      // Print the count of characters from the input file.            Console.WriteLine("Processed {0} bytes from input file.", consumer.Result);              //Print out the input file to the console.            Console.WriteLine("rnrn");            Console.WriteLine("This is the input data file. rn");            using (StreamReader inputfile = new StreamReader(PipelineInputFile))            {                while (inputfile.Peek() >= 0)                {                    Console.Write((char)inputfile.Read());                }              }              //Print out the encrypted file to the console.            Console.WriteLine("rnrn");            Console.WriteLine("This is the encrypted data file. rn");            using (StreamReader encryptfile = new StreamReader(PipelineEncryptFile))            {                while (encryptfile.Peek() >= 0)                {                    Console.Write((char)encryptfile.Read());                }              } Now that you are comfortable implementing a basic producer-consumer design using objects from the TPL Dataflow library, try experimenting with this basic idea but use multiple producers and multiple consumers all with the same BufferBlock object as the queue between them all. Also, try converting our original Pipeline application from the beginning of the article into a TPL Dataflow producer-consumer application with two sets of producers and consumers. The first will act as stage 1 and stage 2, and the second will act as stage 2 and stage 3. So, in effect, stage 2 will be both a consumer and a producer. Summary We have covered a lot in this article. We have learned the benefits and how to implement a Pipeline design pattern and a producer-consumer design pattern. As we saw, these are both very helpful design patterns when building parallel and concurrent applications that require multiple asynchronous processes of data between tasks. In the Pipeline design, we are able to run multiple tasks or stages concurrently even though the stages rely on data being processed and output by other stages. This is very helpful for performance since all functionality doesn't have to wait on each stage to finish processing every item of data. In our example, we are able to start decrypting characters of data while a previous stage is still encrypting data and placing it into the queue. In the Pipeline example, we examined the benefits of the BlockingCollection class in acting as a queue between stages in our pipeline. Next, we explored the new TPL Dataflow library and some of its message block classes. These classes implement several interfaces defined in the library—ISourceBlock, ITargetBlock, and IPropogatorBlock. By implementing these interfaces, it allows us to write generic producer and consumer task functionality that can be reused in a variety of applications. Both of these design patterns and the Dataflow library allow for easy implementations of common functionality in a concurrent manner. You will use these techniques in many applications, and this will become a go-to design pattern when you evaluate a system's requirements and determine how to implement concurrency to help improve performance. Like all programming, parallel programming is made easier when you have a toolbox of easy-to-use techniques that you are comfortable with. Most applications that benefit from parallelism will be conducive to some variation of a producer-consumer or Pipeline pattern. Also, the BlockingCollection and Dataflow message block objects are useful mechanisms for coordinating data between parallel tasks, no matter what design pattern is used in the application. It will be very useful to become comfortable with these messaging and queuing classes. Resources for Article: Further resources on this subject: Parallel Programming Patterns [article] Watching Multiple Threads in C# [article] Clusters, Parallel Computing, and Raspberry Pi – A Brief Background [article]
Read more
  • 0
  • 0
  • 6393

article-image-working-your-team
Packt
19 Dec 2014
14 min read
Save for later

Working with Your Team

Packt
19 Dec 2014
14 min read
In this article by Jarosław Krochmalski, author of the book IntelliJ IDEA Essentials, we will talk about working with VCS systems such as Git and Subversion. While working on the code, one of the most important aspects is version control. A Version Control System (VCS) (also known as a Revision Control System) is a repository of source code files with monitored access. Every change made to the source is tracked, along with who made the change, why they made it, and comments about problems fixed or enhancements introduced by the change. It doesn't matter if you work alone or in a team, having the tool to efficiently work with different versions of the code is crucial. Software development is usually carried out by teams, either distributed or colocated. The version control system lets developers work on a copy of the source code and then release their changes back to the common codebase when ready. Other developers work on their own copies of the same code at the same time, unaffected by each other's changes until they choose to merge or commit their changes back to the project. Currently, probably the most popular version control system is Git. After reading this article, you will be able to set up the version control mechanism of your choice, get files from the repository, commit your work, and browse the changes. Let's start with the version control setup. (For more resources related to this topic, see here.) Enabling version control At the IDE level, version control integration is provided through a set of plugins. IntelliJ IDEA comes bundled with a number of plugins to integrate with the most popular version control systems. They include Git, CVS, Subversion, and Mercurial. The Ultimate edition additionally contains Clearcase, Visual SourceSafe, and Perforce plugins. You will need to enable them in the Plugins section of the Settings dialog box. If you find the VCS feature is not enough and you are using some other VCS, try to find it in the Browse Repositories dialog box by choosing VCS Integration from the Category drop-down menu, as shown here: The list of plugins here contains not only integration plugins, but also some useful add-ons for the installed integrations. For example, the SVN Bar plugin will create a quick access toolbar with buttons specific for Subversion (SVN) actions. Feel free to browse the list of plugins here and read the descriptions; you might find some valuable extensions. The basic principles of working with the version control systems in IntelliJ IDEA are rather similar. We will focus on the Git and Subversion integration. This article should give you an overview of how to deal with the setup and version control commands in IntelliJ IDEA in general. If you have the necessary plugins enabled in the Settings dialog box, you can start working with the version control. We will begin with fetching the project out of the version control. Doing this will set up the version control automatically so that further steps will not be required unless you decide not to use the default workflow. Later, we will cover setting the VCS integration manually, so you will be able to tweak IntelliJ's behavior then. Checking out the project from the repository To be able to work on the files, first you need to get them from the repository. To get the files from the remote Git repository, you need to use the clone command available in the VCS menu, under the Checkout from Version Control option, as shown here: In the Clone Repository dialog box, provide necessary options, such as the remote repository URL, parent directory, and the directory name to clone into, as shown in the following screenshot: After successful cloning, IntelliJ IDEA will suggest creating a project based on the cloned sources. If you don't have the remote repository for your project, you can work with the offline local Git repository. To create a local Git repository, select Create Git repository from the VCS menu, as shown in the following screenshot: This option will execute the git init command in the directory of your choice; it will most probably be the root directory of your project. For the time being, the Git plugin does not allow you to set up remote repositories. You will probably need to set up the remote host for your newly created Git repository before you can actually fetch and push changes. If you are using GitHub for your projects, the great GitHub integration plugin gives you the option to share the project on GitHub. This will create the remote repository automatically. Later, when you want to get the files from the remote repository, just use the Git Pull command. This will basically retrieve changes (fetch) and apply them to the local branch (merge). To obtain a local working copy of a subversion repository, choose Checkout from Version Control and then Subversion from the VCS menu. In the SVN Checkout Options dialog box, you will be able to specify Subversion-specific settings, such as a revision that needs to be checked (HEAD, for example). Again, IntelliJ IDEA will ask if you want to create the project from checked out sources. If you accept the suggestion to create a new project, New Project from Existing Code Wizard will start. Fetching the project out of the repository will create some default VCS configuration in IntelliJ IDEA. It is usually sufficient, but if needed, the configuration can be changed. Let's discuss how to change the configuration in the next section. Configuring version control The VCS configuration in IntelliJ IDEA can be changed at the project level. Head to the Version Control section in the Settings dialog box, as shown here: The Version Control section contains options that are common for all version control systems and also specific options for the different VCS systems (enabled by installing the corresponding plugins). IntelliJ IDEA uses a directory-based model for version control. The versioning mechanism is assigned to a specific directory that can either be a part of a project or can be just related to the project. This directory is not required to be located under the project root. Multiple directories can have different version control systems linked. To add a directory into the version control integration, use the Alt + Insert keyboard shortcut or click on the green plus button; the Add VCS Directory Mapping dialog box will appear. You have the option to put all the project contents, starting from its base directory to the version control or limit the version control only to specific directories. Select the VCS system you need from the VCS drop-down menu, as shown in the following screenshot: By default, IntelliJ IDEA will mark the changed files with a color in the Project tool window, as shown here: If you select the Show directories with changed descendants option, IntelliJ IDEA will additionally mark the directories containing the changed files with a color, giving you the possibility to quickly notice the changes without expanding the project tree, as shown in the following screenshot: The Show changed in last <number> days option will highlight the files changed recently during the debugging process and when displaying stacktraces. Displaying the changed files in color can be very useful. If you see the colored file in the stacktrace, maybe the last change to the file is causing a problem. The subsequent panes contain general version control settings, which apply to all version control systems integrated with the IDE. They include specifying actions that require confirmation, background operations set up, the ignored files list, and issuing of navigation configuration. In the Confirmation section, you specify what version control actions will need your confirmation. The Background section will tell IntelliJ IDEA what operation it should perform in the background, as shown in the following screenshot: If you choose to perform the operation in the background, IntelliJ IDEA will not display any modal windows during and after the operation. The progress and result will be presented in the status bar of the IDE and in the corresponding tool windows. For example, after the successful execution of the Git pull command, IntelliJ IDEA will present the Update Project Info tool window with the files changed and the Event Log tool window with the status of the operation, as shown in the following screenshot: In the Ignored Files section, you can specify a list of files and directories that you do not want to put under version control, as shown in the following screenshot: To add a file or directory, use the Alt + Insert keyboard shortcut or hit the green plus (+) icon. The Ignore Unversioned Files dialog box will pop up as shown here: You can now specify a single file or the directory you want to ignore. There is also the possibility to construct the filename pattern for files to be ignored. Backup and logfiles are good candidates to be specified here, for example. Most of the version control systems support the file with a list of file patterns to ignore. For Git, this will be the .gitignore file. IntelliJ IDEA will analyze such files during the project checkout from the existing repository and will fill the Ignored files list automatically. In the Issue Navigation section, you can create a list of patterns to issue navigation. IntelliJ IDEA will try to use these patterns to create links from the commit messages. These links will then be displayed in the Changes and Version Control tool windows. Clicking on the link will open the browser and take you to the issue tracker of your choice. IntelliJ IDEA comes with predefined patterns for the most popular issue trackers: JIRA and YouTrack. To create a link to JIRA, click on the first button and provide the URL for your JIRA instance, as shown in the following screenshot: To create a link to the YouTrack instance, click on the OK button and provide the URL to the YouTrack instance. If you do not use JIRA or YouTrack, you can also specify a generic pattern. Press the Alt + Insert keyboard shortcut to add a new pattern. In the IssueID field, enter the regular expression that IntelliJ IDEA will use to extract a part of the link. In the Issue Link field, provide the link expression that IntelliJ IDEA will use to replace a issue number within. Use the Example section to check if the resulting link is correct, as shown in the following screenshot: The next sections in the Version Control preferences list contain options specific to the version control system you are using. For example, the Git-specific options can be configured in the Git section, as shown here: You can specify the Git command executable here or select the associated SSH executable that will be used to perform the network Git operations such as pull and push. The Auto-update if push of the current branch was rejected option is quite useful—IntelliJ IDEA will execute the pull command first if the push command fails because of the changes in the repository revision. This saves some time.We should now have version control integration up and running. Let's use it. Working with version control Before we start working with version control, we need to know about the concept of the changelist in IntelliJ IDEA. Let's focus on this now. Changelists When it comes to newly created or modified files, IntelliJ IDEA introduces the concept of a changelist. A changelist is a set of file modifications that represents a logical change in the source. Any modified file will go to the Default changelist. You can create new changelists if you like. The changes contained in a specific changelist are not stored in the repository until committed. Only the active changelist contains the files that are going to be committed. If you modify the file that is contained in the non-active change list, there is a risk that it will not be committed. This takes us to the last section of the common VCS settings at Settings | Version Control | Changelist conflicts. In this section, you can configure the protection of files that are present in the changelist that is not currently active. In other words, you define how IntelliJ IDEA should behave when you modify the file that is not in the active changelist. The protection is turned on by default (Enable changelist conflict tracking is checked). If the Resolve Changelist Conflict checkbox is marked, the IDE will display the Resolve Changelist Conflict dialog box when you try to modify such a file. The possible options are to either shelve the changes (we will talk about the concept of shelving in a while), move a file to the active changelist, switch changelists to make the current changelist active, or ignore the conflict. If Highlight files with conflicts is checked and if you try to modify a file from the non-active change list, a warning will pop up in the editor, as shown in the following screenshot: Again, you will have the possibility to move the changes to another change list, switch the active change list, or ignore the conflict. If you select Ignore, the change will be listed in the Files with ignored conflicts list, as shown in the following screenshot: The list of all changelists in the project is listed in the Commit Changes dialog box (we will cover committing files in a while) and in the first tab of the Changes tool window, as shown here: You can create a new changelist by using the Alt + Insert keyboard shortcut. The active list will have its name highlighted in bold. The last list is special; it contains the list of unversioned files. You can drag-and-drop files between the changelists (with the exception of unversioned files). Now that we know what a changelist is, let's add some files to the repository now. Adding files to version control You will probably want newly created files to be placed in version control. If you create a file in a directory already associated with the version control system, IntelliJ IDEA will add the file to the active changelist automatically, unless you configured this differently in the Confirmation section of the Version Control pane in the Settings dialog box. If you decided to have Show options before adding to version control checked, IntelliJ IDEA will ask if you want to add the file to the VCS, as shown here: If you decide to check the Remember, don't ask again checkbox, IntelliJ IDEA will throw the future new files into version control silently. You can also add new files to the version control explicitly. Click on the file or directory you want to add in the Project tool window and choose the corresponding VCS command; for example: Alternatively, you can open the Changes tool window, and browse Unversioned Files, where you can right-click on the file you want to add and select Add to VCS from the context menu, as shown in the following screenshot: If there are many unversioned files, IntelliJ IDEA will render a link that allows you to browse the files in a separate dialog box, as shown in the following screenshot: In the Unversioned Files dialog box, right-click on the file you want to add and select Add to VCS from the context menu, as shown in the following screenshot: From now on, the file will be ready to commit to the repository. If you've accidently added some files to version control and want to change them to unversioned, you can always revert the file so that it is no longer marked as part of the versioned files. Summary After reading this article, you know how to set up version control, get the project from the repository, commit your work, and get the changes made by other members of your team. Version control in IntelliJ IDEA is tightly integrated into the IDE. All the versioning activities can be executed from the IDE itself—you will not need to use an external tool for this. I believe it will shortly become natural for you to use the provided functionalities. Not being distracted by the use of external tools will result in higher effectiveness. Resources for Article: Further resources on this subject: Improving Your Development Speed [Article] Ridge Regression [Article] Function passing [Article]
Read more
  • 0
  • 0
  • 1690
Banner background image

article-image-performance-optimization
Packt
19 Dec 2014
30 min read
Save for later

Performance Optimization

Packt
19 Dec 2014
30 min read
In this article is written by Mark Kerzner and Sujee Maniyam, the authors of HBase Design Patterns, we will talk about how to write high performance and scalable HBase applications. In particular, will take a look at the following topics: The bulk loading of data into HBase Profiling HBase applications Tips to get good performance on writes Tips to get good performance on reads (For more resources related to this topic, see here.) Loading bulk data into HBase When deploying HBase for the first time, we usually need to import a significant amount of data. This is called initial loading or bootstrapping. There are three methods that can be used to import data into HBase, given as follows: Using the Java API to insert data into HBase. This can be done in a single client, using single or multiple threads. Using MapReduce to insert data in parallel (this approach also uses the Java API), as shown in the following diagram:  Using MapReduce to generate HBase store files in parallel in bulk and then import them into HBase directly. (This approach does not require the use of the API; it does not require code and is very efficient.)  On comparing the three methods speed wise, we have the following order: Java client < MapReduce insert < HBase file import The Java client and MapReduce use HBase APIs to insert data. MapReduce runs on multiple machines and can exploit parallelism. However, both of these methods go through the write path in HBase. Importing HBase files directly, however, skips the usual write path. HBase files already have data in the correct format that HBase understands. That's why importing them is much faster than using MapReduce and the Java client. We covered the Java API earlier. Let's start with how to insert data using MapReduce. Importing data into HBase using MapReduce MapReduce is the distributed processing engine of Hadoop. Usually, programs read/write data from HDFS. Luckily, HBase supports MapReduce. HBase can be the source and the sink for MapReduce programs. A source means MapReduce programs can read from HBase, and sink means results from MapReduce can be sent to HBase. The following diagram illustrates various sources and sinks for MapReduce:     The diagram we just saw can be summarized as follows: Scenario Source Sink Description 1 HDFS HDFS This is a typical MapReduce method that reads data from HDFS and also sends the results to HDFS. 2 HDFS HBase This imports the data from HDFS into HBase. It's a very common method that is used to import data into HBase for the first time. 3 HBase HBase Data is read from HBase and written to it. It is most likely that these will be two separate HBase clusters. It's usually used for backups and mirroring.  Importing data from HDFS into HBase Let's say we have lots of data in HDFS and want to import it into HBase. We are going to write a MapReduce program that reads from HDFS and inserts data into HBase. This is depicted in the second scenario in the table we just saw. Now, we'll be setting up the environment for the following discussion. In addition, you can find the code and the data for this discussion in our GitHub repository at https://github.com/elephantscale/hbase-book. The dataset we will use is the sensor data. Our (imaginary) sensor data is stored in HDFS as CSV (comma-separated values) text files. This is how their format looks: Sensor_id, max temperature, min temperature Here is some sample data: sensor11,90,70 sensor22,80,70 sensor31,85,72 sensor33,75,72 We have two sample files (sensor-data1.csv and sensor-data2.csv) in our repository under the /data directory. Feel free to inspect them. The first thing we have to do is copy these files into HDFS. Create a directory in HDFS as follows: $   hdfs   dfs -mkdir   hbase-import Now, copy the files into HDFS: $   hdfs   dfs   -put   sensor-data*   hbase-import/ Verify that the files exist as follows: $   hdfs   dfs -ls   hbase-import We are ready to insert this data into HBase. Note that we are designing the table to match the CSV files we are loading for ease of use. Our row key is sensor_id. We have one column family and we call it f (short for family). Now, we will store two columns, max temperature and min temperature, in this column family. Pig for MapReduce Pig allows you to write MapReduce programs at a very high level, and inserting data into HBase is just as easy. Here's a Pig script that reads the sensor data from HDFS and writes it in HBase: -- ## hdfs-to-hbase.pigdata = LOAD 'hbase-import/' using PigStorage(',') as (sensor_id:chararray, max:int, min:int);-- describe data;-- dump data; Now, store the data in hbase://sensors using the following line of code: org.apache.pig.backend.hadoop.hbase.HBaseStorage('f:max,f:min'); After creating the table, in the first command, we will load data from the hbase-import directory in HDFS. The schema for the data is defined as follows: Sensor_id : chararray (string)max : intmin : int The describe and dump statements can be used to inspect the data; in Pig, describe will give you the structure of the data object you have, and dump will output all the data to the terminal. The final STORE command is the one that inserts the data into HBase. Let's analyze how it is structured: INTO 'hbase://sensors': This tells Pig to connect to the sensors HBase table. org.apache.pig.backend.hadoop.hbase.HBaseStorage: This is the Pig class that will be used to write in HBase. Pig has adapters for multiple data stores. The first field in the tuple, sensor_id, will be used as a row key. We are specifying the column names for the max and min fields (f:max and f:min, respectively). Note that we have to specify the column family (f:) to qualify the columns. Before running this script, we need to create an HBase table called sensors. We can do this from the HBase shell, as follows: $ hbase shell$ create 'sensors' , 'f'$ quit Then, run the Pig script as follows: $ pig hdfs-to-hbase.pig Now watch the console output. Pig will execute the script as a MapReduce job. Even though we are only importing two small files here, we can insert a fairly large amount of data by exploiting the parallelism of MapReduce. At the end of the run, Pig will print out some statistics: Input(s):Successfully read 7 records (591 bytes) from: "hdfs://quickstart.cloudera:8020/user/cloudera/hbase-import"Output(s):Successfully stored 7 records in: "hbase://sensors" Looks good! We should have seven rows in our HBase sensors table. We can inspect the table from the HBase shell with the following commands: $ hbase shell$ scan 'sensors' This is how your output might look: ROW                      COLUMN+CELL sensor11                 column=f:max, timestamp=1412373703149, value=90 sensor11                 column=f:min, timestamp=1412373703149, value=70 sensor22                 column=f:max, timestamp=1412373703177, value=80 sensor22                column=f:min, timestamp=1412373703177, value=70 sensor31                 column=f:max, timestamp=1412373703177, value=85 sensor31                 column=f:min, timestamp=1412373703177, value=72 sensor33                 column=f:max, timestamp=1412373703177, value=75 sensor33                 column=f:min, timestamp=1412373703177, value=72 sensor44                 column=f:max, timestamp=1412373703184, value=55 sensor44                 column=f:min, timestamp=1412373703184, value=42 sensor45                 column=f:max, timestamp=1412373703184, value=57 sensor45                 column=f:min, timestamp=1412373703184, value=47 sensor55                 column=f:max, timestamp=1412373703184, value=55 sensor55                 column=f:min, timestamp=1412373703184, value=427 row(s) in 0.0820 seconds There you go; you can see that seven rows have been inserted! With Pig, it was very easy. It took us just two lines of Pig script to do the import. Java MapReduce We have just demonstrated MapReduce using Pig, and you now know that Pig is a concise and high-level way to write MapReduce programs. This is demonstrated by our previous script, essentially the two lines of Pig code. However, there are situations where you do want to use the Java API, and it would make more sense to use it than using a Pig script. This can happen when you need Java to access Java libraries or do some other detailed tasks for which Pig is not a good match. For that, we have provided the Java version of the MapReduce code in our GitHub repository. Using HBase's bulk loader utility HBase is shipped with a bulk loader tool called ImportTsv that can import files from HDFS into HBase tables directly. It is very easy to use, and as a bonus, it uses MapReduce internally to process files in parallel. Perform the following steps to use ImportTsv: Stage data files into HDFS (remember that the files are processed using MapReduce). Create a table in HBase if required. Run the import. Staging data files into HDFS The first step to stage data files into HDFS has already been outlined in the previous section. The following sections explain the next two steps to stage data files. Creating an HBase table We will do this from the HBase shell. A note on regions is in order here. Regions are shards created automatically by HBase. It is the regions that are responsible for the distributed nature of HBase. However, you need to pay some attention to them in order to assure performance. If you put all the data in one region, you will cause what is called region hotspotting. What is especially nice about a bulk loader is that when creating a table, it lets you presplit the table into multiple regions. Precreating regions will allow faster imports (because the insert requests will go out to multiple region servers). Here, we are creating a single column family: $ hbase shellhbase> create 'sensors', {NAME => 'f'}, {SPLITS => ['sensor20', 'sensor40', 'sensor60']}0 row(s) in 1.3940 seconds=> Hbase::Table - sensors hbase > describe 'sensors'DESCRIPTION                                       ENABLED'sensors', {NAME => 'f', DATA_BLOCK_ENCODING => true'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE=> '0', VERSIONS => '1', COMPRESSION => 'NONE',MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}1 row(s) in 0.1140 seconds We are creating regions here. Why there are exactly four regions will be clear from the following diagram:   On inspecting the table in the HBase Master UI, we will see this. Also, you can see how Start Key and End Key, which we specified, are showing up. Run the import Ok, now it's time to insert data into HBase. To see the usage of ImportTsv, do the following: $ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv This will print the usage as follows: $ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=, -Dimporttsv.columns=HBASE_ROW_KEY,f:max,f:min sensors   hbase-import/ The following table explains what the parameters mean: Parameter Description -Dimporttsv.separator Here, our separator is a comma (,). The default value is tab (t). -Dimporttsv.columns=HBASE_ROW_KEY,f:max,f:min This is where we map our input files into HBase tables. The first field, sensor_id, is our key, and we use HBASE_ROW_KEY to denote that the rest we are inserting into column family f. The second field, max temp, maps to f:max. The last field, min temp, maps to f:min. sensors This is the table name. hbase-import This is the HDFS directory where the data files are located.  When we run this command, we will see that a MapReduce job is being kicked off. This is how an import is parallelized. Also, from the console output, we can see that MapReduce is importing two files as follows: [main] mapreduce.JobSubmitter: number of splits:2 While the job is running, we can inspect the progress from YARN (or the JobTracker UI). One thing that we can note is that the MapReduce job only consists of mappers. This is because we are reading a bunch of files and inserting them into HBase directly. There is nothing to aggregate. So, there is no need for reducers. After the job is done, inspect the counters and we can see this: Map-Reduce Framework Map input records=7 Map output records=7 This tells us that mappers read seven records from the files and inserted seven records into HBase. Let's also verify the data in HBase: $   hbase shellhbase >   scan 'sensors'ROW                 COLUMN+CELLsensor11           column=f:max, timestamp=1409087465345, value=90sensor11           column=f:min, timestamp=1409087465345, value=70sensor22           column=f:max, timestamp=1409087465345, value=80sensor22           column=f:min, timestamp=1409087465345, value=70sensor31           column=f:max, timestamp=1409087465345, value=85sensor31           column=f:min, timestamp=1409087465345, value=72sensor33           column=f:max, timestamp=1409087465345, value=75sensor33           column=f:min, timestamp=1409087465345, value=72sensor44            column=f:max, timestamp=1409087465345, value=55sensor44           column=f:min, timestamp=1409087465345, value=42sensor45           column=f:max, timestamp=1409087465345, value=57sensor45           column=f:min, timestamp=1409087465345, value=47sensor55           column=f:max, timestamp=1409087465345, value=55sensor55           column=f:min, timestamp=1409087465345, value=427 row(s) in 2.1180 seconds Your output might vary slightly. We can see that seven rows are inserted, confirming the MapReduce counters! Let's take another quick look at the HBase UI, which is shown here:    As you can see, the inserts go to different regions. So, on a HBase cluster with many region servers, the load will be spread across the cluster. This is because we have presplit the table into regions. Here are some questions to test your understanding. Run the same ImportTsv command again and see how many records are in the table. Do you get duplicates? Try to find the answer and explain why that is the correct answer, then check these in the GitHub repository (https://github.com/elephantscale/hbase-book). Bulk import scenarios Here are a few bulk import scenarios: Scenario Methods Notes The data is already in HDFS and needs to be imported into HBase. The two methods that can be used to do this are as follows: If the ImportTsv tool can work for you, then use it as it will save time in writing custom MapReduce code. Sometimes, you might have to write a custom MapReduce job to import (for example, complex time series data, doing data mapping, and so on). It is probably a good idea to presplit the table before a bulk import. This spreads the insert requests across the cluster and results in a higher insert rate. If you are writing a custom MapReduce job, consider using a high-level MapReduce platform such as Pig or Hive. They are much more concise to write than the Java code. The data is in another database (RDBMs/NoSQL) and you need to import it into HBase. Use a utility such as Sqoop to bring the data into HDFS and then use the tools outlined in the first scenario. Avoid writing MapReduce code that directly queries databases. Most databases cannot handle many simultaneous connections. It is best to bring the data into Hadoop (HDFS) first and then use MapReduce. Profiling HBase applications Just like any software development process, once we have our HBase application working correctly, we would want to make it faster. At times, developers get too carried away and start optimizing before the application is finalized. There is a well-known rule that premature optimization is the root of all evil. One of the sources for this rule is Scott Meyers Effective C++. We can perform some ad hoc profiling in our code by timing various function calls. Also, we can use profiling tools to pinpoint the trouble spots. Using profiling tools is highly encouraged for the following reasons: Profiling takes out the guesswork (and a good majority of developers' guesses are wrong). There is no need to modify the code. Manual profiling means that we have to go and insert the instrumentation code all over the code. Profilers work by inspecting the runtime behavior. Most profilers have a nice and intuitive UI to visualize the program flow and time flow. The authors use JProfiler. It is a pretty effective profiler. However, it is neither free nor open source. So, for the purpose of this article, we are going to show you a simple manual profiling, as follows: public class UserInsert {      static String tableName = "users";    static String familyName = "info";      public static void main(String[] args) throws Exception {        Configuration config = HBaseConfiguration.create();        // change the following to connect to remote clusters        // config.set("hbase.zookeeper.quorum", "localhost");        long t1a = System.currentTimeMillis();        HTable htable = new HTable(config, tableName);        long t1b = System.currentTimeMillis();        System.out.println ("Connected to HTable in : " + (t1b-t1a) + " ms");        int total = 100;        long t2a = System.currentTimeMillis();        for (int i = 0; i < total; i++) {            int userid = i;            String email = "user-" + i + "@foo.com";            String phone = "555-1234";              byte[] key = Bytes.toBytes(userid);            Put put = new Put(key);              put.add(Bytes.toBytes(familyName), Bytes.toBytes("email"), Bytes.toBytes(email));            put.add(Bytes.toBytes(familyName), Bytes.toBytes("phone"), Bytes.toBytes(phone));            htable.put(put);          }        long t2b = System.currentTimeMillis();        System.out.println("inserted " + total + " users in " + (t2b - t2a) + " ms");        htable.close();      } } The code we just saw inserts some sample user data into HBase. We are profiling two operations, that is, connection time and actual insert time. A sample run of the Java application yields the following: Connected to HTable in : 1139 msinserted 100 users in 350 ms We spent a lot of time in connecting to HBase. This makes sense. The connection process has to go to ZooKeeper first and then to HBase. So, it is an expensive operation. How can we minimize the connection cost? The answer is by using connection pooling. Luckily, for us, HBase comes with a connection pool manager. The Java class for this is HConnectionManager. It is very simple to use. Let's update our class to use HConnectionManager: Code : File name: hbase_dp.ch8.UserInsert2.java   package hbase_dp.ch8;   import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.client.HConnection; import org.apache.hadoop.hbase.client.HConnectionManager; import org.apache.hadoop.hbase.client.HTable; import org.apache.hadoop.hbase.client.HTableInterface; import org.apache.hadoop.hbase.client.Put; import org.apache.hadoop.hbase.util.Bytes;   public class UserInsert2 {      static String tableName = "users";    static String familyName = "info";      public static void main(String[] args) throws Exception {        Configuration config = HBaseConfiguration.create();        // change the following to connect to remote clusters        // config.set("hbase.zookeeper.quorum", "localhost");               long t1a = System.currentTimeMillis();        HConnection hConnection = HConnectionManager.createConnection(config);        long t1b = System.currentTimeMillis();        System.out.println ("Connection manager in : " + (t1b-t1a) + " ms");          // simulate the first 'connection'        long t2a = System.currentTimeMillis();        HTableInterface htable = hConnection.getTable(tableName) ;        long t2b = System.currentTimeMillis();        System.out.println ("first connection in : " + (t2b-t2a) + " ms");               // second connection        long t3a = System.currentTimeMillis();        HTableInterface htable2 = hConnection.getTable(tableName) ;        long t3b = System.currentTimeMillis();        System.out.println ("second connection : " + (t3b-t3a) + " ms");          int total = 100;        long t4a = System.currentTimeMillis();        for (int i = 0; i < total; i++) {            int userid = i;            String email = "user-" + i + "@foo.com";            String phone = "555-1234";              byte[] key = Bytes.toBytes(userid);            Put put = new Put(key);              put.add(Bytes.toBytes(familyName), Bytes.toBytes("email"), Bytes.toBytes(email));            put.add(Bytes.toBytes(familyName), Bytes.toBytes("phone"), Bytes.toBytes(phone));            htable.put(put);          }      long t4b = System.currentTimeMillis();        System.out.println("inserted " + total + " users in " + (t4b - t4a) + " ms");        hConnection.close();    } } A sample run yields the following timings: Connection manager in : 98 ms first connection in : 808 ms second connection : 0 ms inserted 100 users in 393 ms The first connection takes a long time, but then take a look at the time of the second connection. It is almost instant ! This is cool! If you are connecting to HBase from web applications (or interactive applications), use connection pooling. More tips for high-performing HBase writes Here we will discuss some techniques and best practices to improve writes in HBase. Batch writes Currently, in our code, each time we call htable.put (one_put), we make an RPC call to an HBase region server. This round-trip delay can be minimized if we call htable.put() with a bunch of put records. Then, with one round trip, we can insert a bunch of records into HBase. This is called batch puts. Here is an example of batch puts. Only the relevant section is shown for clarity. For the full code, see hbase_dp.ch8.UserInsert3.java:        int total = 100;        long t4a = System.currentTimeMillis();        List<Put> puts = new ArrayList<>();        for (int i = 0; i < total; i++) {            int userid = i;            String email = "user-" + i + "@foo.com";            String phone = "555-1234";              byte[] key = Bytes.toBytes(userid);            Put put = new Put(key);              put.add(Bytes.toBytes(familyName), Bytes.toBytes("email"), Bytes.toBytes(email));            put.add(Bytes.toBytes(familyName), Bytes.toBytes("phone"), Bytes.toBytes(phone));                       puts.add(put); // just add to the list        }        htable.put(puts); // do a batch put        long t4b = System.currentTimeMillis();        System.out.println("inserted " + total + " users in " + (t4b - t4a) + " ms"); A sample run with a batch put is as follows: inserted 100 users in 48 ms The same code with individual puts took around 350 milliseconds! Use batch writes when you can to minimize latency. Note that the HTableUtil class that comes with HBase implements some smart batching options for your use and enjoyment. Setting memory buffers We can control when the puts are flushed by setting the client write buffer option. Once the data in the memory exceeds this setting, it is flushed to disk. The default setting is 2 M. Its purpose is to limit how much data is stored in the buffer before writing it to disk. There are two ways of setting this: In hbase-site.xml (this setting will be cluster-wide): <property>  <name>hbase.client.write.buffer</name>    <value>8388608</value>   <!-- 8 M --></property> In the application (only applies for that application): htable.setWriteBufferSize(1024*1024*10); // 10 Keep in mind that a bigger buffer takes more memory on both the client side and the server side. As a practical guideline, estimate how much memory you can dedicate to the client and put the rest of the load on the cluster. Turning off autofush If autoflush is enabled, each htable.put() object incurs a round trip RPC call to HRegionServer. Turning autoflush off can reduce the number of round trips and decrease latency. To turn it off, use this code: htable.setAutoFlush(false); The risk of turning off autoflush is if the client crashes before the data is sent to HBase, it will result in a data loss. Still, when will you want to do it? The answer is: when the danger of data loss is not important and speed is paramount. Also, see the batch write recommendations we saw previously. Turning off WAL Before we discuss this, we need to emphasize that the write-ahead log (WAL) is there to prevent data loss in the case of server crashes. By turning it off, we are bypassing this protection. Be very careful when choosing this. Bulk loading is one of the cases where turning off WAL might make sense. To turn off WAL, set it for each put: put.setDurability(Durability.SKIP_WAL); More tips for high-performing HBase reads So far, we looked at tips to write data into HBase. Now, let's take a look at some tips to read data faster. The scan cache When reading a large number of rows, it is better to set scan caching to a high number (in the 100 seconds or 1,000 seconds range). Otherwise, each row that is scanned will result in a trip to HRegionServer. This is especially encouraged for MapReduce jobs as they will likely consume a lot of rows sequentially. To set scan caching, use the following code: Scan scan = new Scan(); scan.setCaching(1000); Only read the families or columns needed When fetching a row, by default, HBase returns all the families and all the columns. If you only care about one family or a few attributes, specifying them will save needless I/O. To specify a family, use this: scan.addFamily( Bytes.toBytes("familiy1")); To specify columns, use this: scan.addColumn( Bytes.toBytes("familiy1"),   Bytes.toBytes("col1")) The block cache When scanning large rows sequentially (say in MapReduce), it is recommended that you turn off the block cache. Turning off the cache might be completely counter-intuitive. However, caches are only effective when we repeatedly access the same rows. During sequential scanning, there is no caching, and turning on the block cache will introduce a lot of churning in the cache (new data is constantly brought into the cache and old data is evicted to make room for the new data). So, we have the following points to consider: Turn off the block cache for sequential scans Turn off the block cache for random/repeated access Benchmarking or load testing HBase Benchmarking is a good way to verify HBase's setup and performance. There are a few good benchmarks available: HBase's built-in benchmark The Yahoo Cloud Serving Benchmark (YCSB) JMeter for custom workloads HBase's built-in benchmark HBase's built-in benchmark is PerformanceEvaluation. To find its usage, use this: $   hbase org.apache.hadoop.hbase.PerformanceEvaluation To perform a write benchmark, use this: $ hbase org.apache.hadoop.hbase.PerformanceEvaluation --nomapred randomWrite 5 Here we are using five threads and no MapReduce. To accurately measure the throughput, we need to presplit the table that the benchmark writes to. It is TestTable. $ hbase org.apache.hadoop.hbase.PerformanceEvaluation --nomapred --presplit=3 randomWrite 5 Here, the table is split in three ways. It is good practice to split the table into as many regions as the number of region servers. There is a read option along with a whole host of scan options. YCSB The YCSB is a comprehensive benchmark suite that works with many systems such as Cassandra, Accumulo, and HBase. Download it from GitHub, as follows: $   git clone git://github.com/brianfrankcooper/YCSB.git Build it like this: $ mvn -DskipTests package Create an HBase table to test against: $ hbase shellhbase> create 'ycsb', 'f1' Now, copy hdfs-site.xml for your cluster into the hbase/src/main/conf/ directory and run the benchmark: $ bin/ycsb load hbase -P workloads/workloada -p columnfamily=f1 -p table=ycsb YCSB offers lots of workloads and options. Please refer to its wiki page at https://github.com/brianfrankcooper/YCSB/wiki. JMeter for custom workloads The standard benchmarks will give you an idea of your HBase cluster's performance. However, nothing can substitute measuring your own workload. We want to measure at least the insert speed or the query speed. We also want to run a stress test. So, we can measure the ceiling on how much our HBase cluster can support. We can do a simple instrumentation as we did earlier too. However, there are tools such as JMeter that can help us with load testing. Please refer to the JMeter website and check out the Hadoop or HBase plugins for JMeter. Monitoring HBase Running any distributed system involves decent monitoring. HBase is no exception. Luckily, HBase has the following capabilities: HBase exposes a lot of metrics These metrics can be directly consumed by monitoring systems such as Ganglia We can also obtain these metrics in the JSON format via the REST interface and JMX Monitoring is a big subject and we consider it as part HBase administration. So, in this section, we will give pointers to tools and utilities that allow you to monitor HBase. Ganglia Ganglia is a generic system monitor that can monitor hosts (such as CPU, disk usage, and so on). The Hadoop stack has had a pretty good integration with Ganglia for some time now. HBase and Ganglia integration is set up by modern installers from Cloudera and Hortonworks. To enable Ganglia metrics, update the hadoop-metrics.properties file in the HBase configuration directory. Here's a sample file: hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 hbase.period=10 hbase.servers=ganglia-server:PORT jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 jvm.period=10 jvm.servers=ganglia-server:PORT rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 rpc.period=10 rpc.servers=ganglia-server:PORT This file has to be uploaded to all the HBase servers (master servers as well as region servers). Here are some sample graphs from Ganglia (these are Wikimedia statistics, for example): These graphs show cluster-wide resource utilization. OpenTSDB OpenTSDB is a scalable time series database. It can collect and visualize metrics on a large scale. OpenTSDB uses collectors, light-weight agents that send metrics to the open TSDB server to collect metrics, and there is a collector library that can collect metrics from HBase. You can see all the collectors at http://opentsdb.net/docs/build/html/user_guide/utilities/tcollector.html. An interesting factoid is that OpenTSDB is built on Hadoop/HBase. Collecting metrics via the JMX interface HBase exposes a lot of metrics via JMX. This page can be accessed from the web dashboard at http://<hbase master>:60010/jmx. For example, for a HBase instance that is running locally, it will be http://localhost:60010/jmx. Here is a sample screenshot of the JMX metrics via the web UI: Here's a quick example of how to programmatically retrieve these metrics using curl: $ curl 'localhost:60010/jmx' Since this is a web service, we can write a script/application in any language (Java, Python, or Ruby) to retrieve and inspect the metrics. Summary In this article, you learned how to push the performance of our HBase applications up. We looked at how to effectively load a large amount of data into HBase. You also learned about benchmarking and monitoring HBase and saw tips on how to do high-performing reads/writes. Resources for Article:   Further resources on this subject: The HBase's Data Storage [article] Hadoop and HDInsight in a Heartbeat [article] Understanding the HBase Ecosystem [article]
Read more
  • 0
  • 0
  • 5004

article-image-how-to-build-a-koa-web-application-part-1
Christoffer Hallas
15 Dec 2014
8 min read
Save for later

How to Build a Koa Web Application - Part 1

Christoffer Hallas
15 Dec 2014
8 min read
You may be a seasoned or novice web developer, but no matter your level of experience, you must always be able to set up a basic MVC application. This two part series will briefly show you how to use Koa, a bleeding edge Node.js web application framework to create a web application using MongoDB as its database. Koa has a low footprint and tries to be as unbiased as possible. For this series, we will also use Jade and Mongel, two Node.js libraries that provide HTML template rendering and MongoDB model interfacing, respectively. Note that this series requires you to use Node.js version 0.11+. At the end of the series, we will have a small and basic app where you can create pages with a title and content, list your pages, and view them. Let’s get going! Using NPM and Node.js If you do not already have Node.js installed, you can download installation packages at the official Node.js website, http://nodejs.org. I strongly suggest that you install Node.js in order to code along with the article. Once installed, Node.js will add two new programs to your computer that you can access from your terminal; they’re node and npm. The first program is the main Node.js program and is used to run Node.js applications, and the second program is the Node Package Manager and it’s used to install Node.js packages. For this application we start out in an empty folder by using npm to install four libraries: $ npm install koa jade mongel co-body Once this is done, open your favorite text editor and create an index.js file in the folder in which we will now start our creating our application. We start by using the require function to load the four libraries we just installed: var koa = require('koa'); var jade = require('jade'); var mongel = require('mongel'); var parse = require(‘co-body'); This simply loads the functionality of the libraries into the respective variables. This lets us create our Page model and our Koa app variables: var Page = mongel('pages', ‘mongodb://localhost/app'); var app = koa(); As you can see, we now use the variables mongel and koa that we previously loaded into our program using require. To create a model with mongel, all we have to do is give the name of our MongoDB collection and a MongoDB connection URI that represents the network location of the database; in this case we’re using a local installation of MongoDB and a database called app. It’s simple to create a basic Koa application, and as seen in the code above, all we do is create a new variable called app that is the result of calling the Koa library function. Middleware, generators, and JavaScript Koa uses a new feature in JavaScript called generators. Generators are not widely available in browsers yet except for some versions of Google Chrome, but since Node.js is built on the same JavaScript as Google Chrome it can use generators. The generators function is much like a regular JavaScript function, but it has a special ability to yield several values along with the normal ability of returning a single value. Some expert JavaScript programmers used this to create a new and improved way of writing asynchronous code in JavaScript, which is required when building a networked application such as a web application. The generators function is a complex subject and we won’t cover it in detail. We’ll just show you how to use it in our small and basic app. In Koa, generators are used as something called middleware, a concept that may be familiar to you from other languages such as Ruby and Python. Think of middleware as a stack of functions through which an HTTP request must travel in order to create an appropriate response. Middleware should be created so that the functionality of a given middleware is encapsulated together. In our case, this means we’ll be creating two pieces of middleware: one to create pages and one to list pages or show a page. Let’s create our first middleware: app.use(function* (next) { … }); As you can see, we start by calling the app.use function, which takes a generator as its argument, and this effectively pushes the generator into the stack. To create a generator, we use a special function syntax where an asterisk is added as seen in the previous code snippet. We let our generator take a single argument called next, which represents the next middleware in the stack, if any. From here on, it is simply a matter of checking and responding to the parameters of the HTTP request, which are accessible to us in the Koa context. This is also the function context, which in JavaScript is the keyword this, similar to other languages and the keyword self: if (this.path != '/create') { yield next; return } Since we’re creating some middleware that helps us create pages, we make sure that this request is for the right path, in our case, /create; if not, we use the yield keyword and the next argument to pass the control of the program to the next middleware. Please note the return keyword that we also use; this is very important in this case as the middleware would otherwise continue while also passing control to the next middleware. This is not something you want to happen unless the middleware you’re in will not modify the Koa context or HTTP response, because subsequent middleware will always expect that they’re now in control. Now that we have checked that the path is correct, we still have to check the method to see if we’re just showing the form to create a page, or if we should actually create a page in the database: if (this.method == 'POST') { var body = yield parse.form(this); var page = yield Page.createOne({    title: body.title,    contents: body.contents }); this.redirect('/' + page._id); return } else if (this.method != 'GET') { this.status = 405; this.body = 'Method Not Allowed'; return } To check the method, we use the Koa context again and the method attribute. If we’re handling a POST request we now know how to create a page, but this also means that we must extract extra information from the request. Koa does not process the body of a request, only the headers, so we use the co-body library that we downloaded early and loaded in as the parse variable. Notice how we yield on the parse.form function; this is because this is an asynchronous function and we have to wait until it is done before we continue the program. Then we proceed to use our mongel model Page to create a page using the data we found in the body of the request, again this is an asynchronous function and we use yield to wait before we finally redirect the request using the page’s database id. If it turns out the method was not POST, we still want to use this middleware to show the form that is actually used to issue the request. That means we have to make sure that the method is GET, so we added an else if statement to the original check, and if the request is neither POST or GET we respond with an HTTP status 405 and the message Method Not Allowed, which is the appropriate response for this case. Notice how we don’t yield next; this is because the middleware was able to determine a satisfying response for the request and it requires no further processing. Finally, if the method was actually POST, we use the Jade library that we also installed using npm to render a create.jade template in HTML: var html = jade.renderFile('create.jade'); this.body = html; Notice how we set the Koa context’s body attribute to the rendered HTML from Jade; all this does is tell Koa that we want to send that back to the browser that sent the request. Wrapping up You are well on your way to creating your Koa app. In Part 2 we will implement Jade templates and list and view pages. Ready for the next step? Read Part 2 here. Explore all of our top Node.js content in one place - visit our Node.js page today! About the author Christoffer Hallas is a software developer and entrepreneur from Copenhagen, Denmark. He is a computer polyglot and contributes to and maintains a number of open source projects. When not contemplating his next grand idea (which remains an idea) he enjoys music, sports, and design of all kinds. Christoffer can be found on GitHub as hallas and at Twitter as @hamderhallas.
Read more
  • 0
  • 0
  • 3024

article-image-qgis-feature-selection-tools
Packt
05 Dec 2014
4 min read
Save for later

QGIS Feature Selection Tools

Packt
05 Dec 2014
4 min read
 In this article by Anita Graser, the author of Learning QGIS Third Edition, we will cover the following topics: Selecting features with the mouse Selecting features using expressions Selecting features using Spatial queries (For more resources related to this topic, see here.) Selecting features with the mouse The first group of tools in the Attributes toolbar allows us to select features on the map using the mouse. The following screenshot shows the Select Feature(s) tool. We can select a single feature by clicking on it or select multiple features by drawing a rectangle. The other tools can be used to select features by drawing different shapes: polygons, freehand areas, or circles around the features. All features that intersect with the drawn shape are selected. Holding down the Ctrl key will add the new selection to an existing one. Similarly, holding down Ctrl + Shift will remove the new selection from the existing selection. Selecting features by expression The second type of select tool is called Select by Expression, and it is also available in the Attribute toolbar. It selects features based on expressions that can contain references and functions using feature attributes and/or geometry. The list of available functions is pretty long, but we can use the search box to filter the list by name to find the function we are looking for faster. On the right-hand side of the window, we will find Selected Function Help, which explains the functionality and how to use the function in an expression. The Function List option also shows the layer attribute fields, and by clicking on Load all unique values or Load 10 sample values, we can easily access their content. As with the mouse tools, we can choose between creating a new selection or adding to or deleting from an existing selection. Additionally, we can choose to only select features from within an existing selection. Let's have a look at some example expressions that you can build on and use in your own work: Using the lakes.shp file in our sample data, we can, for example, select big lakes with an area bigger than 1,000 square miles using a simple attribute query, "AREA_MI" > 1000.0, or using geometry functions such as $area > (1000.0 * 27878400). Note that the lakes.shp CRS uses feet, and we, therefore, have to multiply by 27,878,400 to convert from square feet to square miles. The dialog will look like the one shown in the following screenshot. We can also work with string functions, for example, to find lakes with long names, such as length("NAMES") > 12, or lakes with names that contain the s or S character, such as lower("NAMES") LIKE '%s%', which first converts the names to lowercase and then looks for any appearance of s. Selecting features using spatial queries The third type of tool is called Spatial Query and allows us to select features in one layer based on their location, relative to the features in a second layer. These tools can be accessed by going to Vector | Research Tools | Select by location and then going to Vector | Spatial Query | Spatial Query. Enable it in Plugin Manager if you cannot find it in the Vector menu. In general, we want to use the Spatial Query plugin, as it supports a variety of spatial operations such as crosses, equals, intersects, is disjoint, overlaps, touches, and contains, depending on the layer's geometry type. Let's test the Spatial Query plugin using railroads.shp and pipelines.shp from the sample data. For example, we might want to find all the railroad features that cross a pipeline; we will, therefore, select the railroads layer, the Crosses operation, and the pipelines layer. After clicking on Apply, the plugin presents us with the query results. There is a list of IDs of the result features on the right-hand side of the window, as you can see in the following screenshot. Below this list, we can select the Zoom to item checkbox, and QGIS will zoom to the feature that belongs to the selected ID. Additionally, the plugin offers buttons to directly save all the resulting features to a new layer. Summary This article introduced you to three solutions to select features in QGIS: selecting features with mouse, using spatial queries, and using expressions. Resources for Article: Further resources on this subject: Editing attributes [article] Server Logs [article] Improving proximity filtering with KNN [article]
Read more
  • 0
  • 0
  • 7132
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $15.99/month. Cancel anytime
article-image-ogc-esri-professionals
Packt
27 Nov 2014
16 min read
Save for later

OGC for ESRI Professionals

Packt
27 Nov 2014
16 min read
In this article by Stefano Iacovella, author of GeoServer Cookbook, we look into a brief comparison between GeoServer and ArcGIS for Server, a map server created by ESRI. The importance of adopting OGC standards when building a geographical information system is stressed. We will also learn how OGC standards let us create a system where different pieces of software cooperate with each other. (For more resources related to this topic, see here.) ArcGIS versus GeoServer As an ESRI professional, you obviously know the server product from this vendor that can be compared to GeoServer well. It is called ArcGIS for Server and in many ways it can play the same role as that of GeoServer, and the opposite is true as well, of course. Undoubtedly, the big question for you is: why should I use GeoServer and not stand safely on the vendor side, leveraging on integration with the other software members of the big ArcGIS family? Listening to colleagues, asking to experts, and browsing on the Internet, you'll find a lot of different answers to this question, often supported by strong arguments and somehow by a religious and fanatic approach. There are a few benchmarks available on the Internet that compare performances of GeoServer and other open source map servers versus ArcGIS for Server. Although they're not definitely authoritative, a reasonably objective advantage of GeoServer and its OS cousins on ArcGIS for Server is recognizable. Anyway, I don't think that your choice should overestimate the importance of its performance. I'm sorry but my answer to your original question is another question: why should you choose a particular piece of software? This may sound puzzling, so let me elaborate a bit on the topic. Let's say you are an IT architect and a customer asked you to design a solution for a GIS portal. Of course, in that specific case, you have to give him or her a detailed response, containing specific software that'll be used for data publication. Also, as a professional, you'll arrive to the solution by accurately considering all requirements and constraints that can be inferred from the talks and surveying what is already up and running at the customer site. Then, a specific answer to what the software best suited for the task is should exist in any specific case. However, if you consider the question from a more general point of view, you should be aware that a map server, which is the best choice for any specific case, does not exist. You may find that the licensing costs a limit in some case or the performances in some other cases will lead you to a different choice. Also, as in any other job, the best tool is often the one you know better, and this is quite true when you are in a hurry and your customer can't wait to have the site up and running. So the right approach, although a little bit generic, is to keep your mind open and try to pick the right tool for any scenario. However, a general answer does exist. It's not about the vendor or the name of the piece of software you're going to use; it's about the way the components or your system communicate among them and with external systems. It's about standard protocol. This is a crucial consideration for any GIS architect or developer; nevertheless, if you're going to use an ESRI suite of products or open source tools, you should create your system with special care to expose data with open standards. Understanding standards Let's take a closer look at what standards are and why they're so important when you are designing your GIS solution. The term standard as mentioned in Wikipedia (http://en.wikipedia.org/wiki/ Technical_standard) may be explained as follows: "An established norm or requirement in regard to technical systems. It is usually a formal document that establishes uniform engineering or technical criteria, methods, processes and practices. In contrast, a custom, convention, company product, corporate standard, etc. that becomes generally accepted and dominant is often called a de facto standard." Obviously, a lot of standards exist if you consider the Information Technology domain. Standards are usually formalized by standards organization, which usually involves several members from different areas, such as government agencies, private companies, education, and so on. In the GIS world, an authoritative organization is the Open Geospatial Consortium (OGC), which you may find often cited in this book in many links to the reference information. In recent years, OGC has been publishing several standards that cover the interaction of the GIS system and details on how data is transferred from one software to another. We'll focus on three of them that are widely used and particularly important for GeoServer and ArcGIS for Server: WMS: This is the acronym for Web Mapping Service. This standard describes how a server should publish data for mapping purposes, which is a static representation of data. WFS: This is the acronym for Web Feature Service. This standard describes the details of publishing data for feature streaming to a client. WCS: This is the acronym for Web Coverage Service. This standard describes the details of publishing data for raster data streaming to a client. It's the equivalent of WFS applied to raster data. Now let's dive into these three standards. We'll explore the similarities and differences among GeoServer and ArcGIS for Server. WMS versus the mapping service As an ESRI user, you surely know how to publish some data in a map service. This lets you create a web service that can be used by a client who wants to show the map and data. This is the proprietary equivalent of exposing data through a WMS service. With WMS, you can inquire the server for its capabilities with an HTTP request: $ curl -XGET -H 'Accept: text/xml' 'http://localhost:8080/geoserver/wms?service=WMS &version=1.1.1&request=GetCapabilities' -o capabilitiesWMS.xml Browsing through the XML document, you'll know which data is published and how this can be represented. If you're using the proprietary way of exposing map services with ESRI, you can perform a similar query that starts from the root: $ curl -XGET 'http://localhost/arcgis/rest/services?f=pjson' -o capabilitiesArcGIS.json The output, in this case formatted as a JSON file, is a text file containing the first of the services and folders available to an anonymous user. It looks like the following code snippet: {"currentVersion": 10.22,"folders": ["Geology","Cultural data",…"Hydrography"],"services": [{"name": "SampleWorldCities","type": "MapServer"}]} At a glance, you can recognize two big differences here. Firstly, there are logical items, which are the folders that work only as a container for services. Secondly, there is no complete definition of items, just a list of elements contained at a certain level of a publishing tree. To obtain specific information about an element, you can perform another request pointing to the item: $ curl -XGET 'http://localhost/arcgis/rest/ services/SampleWorldCities/MapServer?f=pjson' -o SampleWorldCities.json Setting up an ArcGIS site is out of the scope of this book; besides, this appendix assumes that you are familiar with the software and its terminology. Anyway, all the examples use the SampleWorldCities service, which is a default service created by the standard installation. In the new JSON file, you'll find a lot of information about the specific service: {"currentVersion": 10.22,"serviceDescription": "A sample service just for demonstation.","mapName": "World Cities Population","description": "","copyrightText": "","supportsDynamicLayers": false,"layers": [{"id": 0,"name": "Cities","parentLayerId": -1,"defaultVisibility": true,"subLayerIds": null,"minScale": 0,"maxScale": 0},…"supportedImageFormatTypes":"PNG32,PNG24,PNG,JPG,DIB,TIFF,EMF,PS,PDF,GIF,SVG,SVGZ,BMP",…"capabilities": "Map,Query,Data","supportedQueryFormats": "JSON, AMF","exportTilesAllowed": false,"maxRecordCount": 1000,"maxImageHeight": 4096,"maxImageWidth": 4096,"supportedExtensions": "KmlServer"} Please note the information about the image format supported. We're, in fact, dealing with a map service. As for the operation supported, this one shows three different operations: Map, Query, and Data. For the first two, you can probably recognize the equivalent of the GetMap and GetFeatureinfo operations of WMS, while the third one is little bit more mysterious. In fact, it is not relevant to map services and we'll explore it in the next paragraph. If you're familiar with the GeoServer REST interface, you can see the similarities in the way you can retrieve information. We don't want to explore the ArcGIS for Server interface in detail and how to handle it. What is important to understand is the huge difference with the standard WMS capabilities document. If you're going to create a client to interact with maps produced by a mix of ArcGIS for Server and GeoServer, you should create different interfaces for both. In one case, you can interact with the proprietary REST interface and use the standard WMS for GeoServer. However, there is good news for you. ESRI also supports standards. If you go to the map service parameters page, you can change the way the data is published.   The situation shown in the previous screenshot is the default capabilities configuration. As you can see, there are options for WMS, WFS, and WCS, so you can expose your data with ArcGIS for Server according to the OGC standards. If you enable the WMS option, you can now perform this query: $ curl -XGET 'http://localhost/arcgis/ services/SampleWorldCities/MapServer/ WMSServer?SERVICE=WMS&VERSION=1.3.0&REQUEST=GetCapabilities'    -o capabilitiesArcGISWMS.xml The information contained is very similar to that of the GeoServer capabilities. A point of attention is about fundamental differences in data publishing with the two software. In ArcGIS for Server, you always start from a map project. A map project is a collection of datasets, containing vector or raster data, with a drawing order, a coordinate reference system, and rules to draw. It is, in fact, very similar to a map project you can prepare with a GIS desktop application. Actually, in the ESRI world, you should use ArcGIS for desktop to prepare the map project and then publish it on the server. In GeoServer, the map concept doesn't exist. You publish data, setting several parameters, and the map composition is totally demanded to the client. You can only mimic a map, server side, using the group layer for a logical merge of several layers in a single entity. In ArcGIS for Server, the map is central to the publication process; also, if you just want to publish a single dataset, you have to create a map project, containing just that dataset, and publish it. Always remember this different approach; when using WMS, you can use the same operation on both servers. A GetMap request on the previous map service will look like this: $ curl -XGET 'http://localhost/arcgis/services/ SampleWorldCities/MapServer/WMSServer?service= WMS&version=1.1.0&request=GetMap&layers=fields&styles =&bbox=47.130647,8.931116,48.604188,29.54223&srs= EPSG:4326&height=445&width=1073&format=img/png' -o map.png Please note that you can filter what layers will be drawn in the map. By default, all the layers contained in the map service definition will be drawn. WFS versus feature access If you open the capabilities panel for the ArcGIS service again, you will note that there is an option called feature access. This lets you enable the feature streaming to a client. With this option enabled, your clients can acquire features and symbology information to ArcGIS and render them directly on the client side. In fact, feature access can also be used to edit features, that is, you can modify the features on the client and then post the changes on the server. When you check the Feature Access option, many specific settings appear. In particular, you'll note that by default, the Update operation is enabled, but the Geometry Updates is disabled, so you can't edit the shape of each feature. If you want to stream features using a standard approach, you should instead turn on the WFS option. ArcGIS for Server supports versions 1.1 and 1.0 of WFS. Moreover, the transactional option, also known as WFS-T, is fully supported.   As you can see in the previous screenshot, when you check the WFS option, several more options appear. In the lower part of the panel, you'll find the option to enable the transaction, which is the editing feature. In this case, there is no separate option for geometry and attributes; you can only decide to enable editing on any part of your features. After you enable the WFS, you can access the capabilities from this address: $ curl -XGET 'http://localhost/arcgis/services/ SampleWorldCities/MapServer/WFSServer?SERVICE=WFS&VERSION=1.1. 0&REQUEST=GetCapabilities' -o capabilitiesArcGISWFS.xml Also, a request for features is shown as follows: $ curl -XGET "http://localhost/arcgis/services/SampleWorldCities /MapServer/WFSServer?service=wfs&version=1.1.0 &request=GetFeature&TypeName=SampleWorldCities: cities&maxFeatures=1" -o getFeatureArcGIS.xml This will output a GML code as a result of your request. As with WMS, the syntax is the same. You only need to pay attention to the difference between the service and the contained layers: <wfs:FeatureCollection xsi:schemaLocation="http://localhost/arcgis/services/SampleWorldCities/MapServer/WFSServer http://localhost/arcgis/services/SampleWorldCities/MapServer/WFSServer?request=DescribeFeatureType%26version=1.1.0%26typename=citieshttp://www.opengis.net/wfs http://schemas.opengis.net/wfs/1.1.0/wfs.xsd"><gml:boundedBy><gml:Envelope srsName="urn:ogc:def:crs:EPSG:6.9:4326"><gml:lowerCorner>-54.7919921875 -176.1514892578125</gml:lowerCorner><gml:upperCorner>78.2000732421875179.221923828125</gml:upperCorner></gml:Envelope></gml:boundedBy><gml:featureMember><SampleWorldCities:cities gml_id="F4__1"><SampleWorldCities:OBJECTID>1</SampleWorldCities:OBJECTID><SampleWorldCities:Shape><gml:Point><gml:pos>-15.614990234375 -56.093017578125</gml:pos></gml:Point></SampleWorldCities:Shape><SampleWorldCities:CITY_NAME>Cuiaba</SampleWorldCities:CITY_NAME><SampleWorldCities:POP>521934</SampleWorldCities:POP><SampleWorldCities:POP_RANK>3</SampleWorldCities:POP_RANK><SampleWorldCities:POP_CLASS>500,000 to999,999</SampleWorldCities:POP_CLASS><SampleWorldCities:LABEL_FLAG>0</SampleWorldCities:LABEL_FLAG></SampleWorldCities:cities></gml:featureMember></wfs:FeatureCollection> Publishing raster data with WCS The WCS option is always present in the panel to configure services. As we already noted, WCS is used to publish raster data, so this may sound odd to you. Indeed, ArcGIS for Server lets you enable the WCS option, only if the map project for the service contains one of the following: A map containing raster or mosaic layers A raster or mosaic dataset A layer file referencing a raster or mosaic dataset A geodatabase that contains raster data If you try to enable the WCS option on SampleWorldCities, you won't get an error. Then, try to ask for the capabilities: $ curl -XGET "http://localhost/arcgis/services /SampleWorldCities/MapServer/ WCSServer?SERVICE=WCS&VERSION=1.1.1&REQUEST=GetCapabilities" -o capabilitiesArcGISWCS.xml You'll get a proper document, compliant to the standard and well formatted, but containing no reference to any dataset. Indeed, the sample service does not contain any raster data:  <Capabilities xsi_schemaLocation="http://www.opengis.net/wcs/1.1.1http://schemas.opengis.net/wcs/1.1/wcsGetCapabilities.xsdhttp://www.opengis.net/ows/1.1/http://schemas.opengis.net/ows/1.1.0/owsAll.xsd"version="1.1.1"><ows:ServiceIdentification><ows:Title>WCS</ows:Title><ows:ServiceType>WCS</ows:ServiceType><ows:ServiceTypeVersion>1.0.0</ows:ServiceTypeVersion><ows:ServiceTypeVersion>1.1.0</ows:ServiceTypeVersion><ows:ServiceTypeVersion>1.1.1</ows:ServiceTypeVersion><ows:ServiceTypeVersion>1.1.2</ows:ServiceTypeVersion><ows:Fees>NONE</ows:Fees><ows:AccessConstraints>None</ows:AccessConstraints></ows:ServiceIdentification>...<Contents><SupportedCRS>urn:ogc:def:crs:EPSG::4326</SupportedCRS><SupportedFormat>image/GeoTIFF</SupportedFormat><SupportedFormat>image/NITF</SupportedFormat><SupportedFormat>image/JPEG</SupportedFormat><SupportedFormat>image/PNG</SupportedFormat><SupportedFormat>image/JPEG2000</SupportedFormat><SupportedFormat>image/HDF</SupportedFormat></Contents></Capabilities> If you want to try out WCS, other than the GetCapabilities operation, you need to publish a service with raster data; or, you may take a look at the sample service from ESRI arcgisonline™. Try the following request: $ curl -XGET "http://sampleserver3.arcgisonline.com/ ArcGIS/services/World/Temperature/ImageServer/ WCSServer?SERVICE=WCS&VERSION=1.1.0&REQUEST=GETCAPABILITIES" -o capabilitiesArcGISWCS.xml Parsing the XML file, you'll find that the contents section now contains coverage, raster data that you can retrieve from that server:  …<Contents><CoverageSummary><ows:Title>Temperature1950To2100_1</ows:Title><ows:Abstract>Temperature1950To2100</ows:Abstract><ows:WGS84BoundingBox><ows:LowerCorner>-179.99999999999994 -55.5</ows:LowerCorner><ows:UpperCorner>180.00000000000006 83.5</ows:UpperCorner></ows:WGS84BoundingBox><Identifier>1</Identifier></CoverageSummary><SupportedCRS>urn:ogc:def:crs:EPSG::4326</SupportedCRS><SupportedFormat>image/GeoTIFF</SupportedFormat><SupportedFormat>image/NITF</SupportedFormat><SupportedFormat>image/JPEG</SupportedFormat><SupportedFormat>image/PNG</SupportedFormat><SupportedFormat>image/JPEG2000</SupportedFormat><SupportedFormat>image/HDF</SupportedFormat></Contents> You can, of course, use all the operations supported by standard. The following request will return a full description of one or more coverages within the service in the GML format. An example of the URL is shown as follows: $ curl -XGET "http://sampleserver3.arcgisonline.com/ ArcGIS/services/World/Temperature/ImageServer/ WCSServer?SERVICE=WCS&VERSION=1.1.0&REQUEST=DescribeCoverage& COVERAGE=1" -o describeCoverageArcGISWCS.xml Also, you can obviously request for data, and use requests that will return coverage in one of the supported formats, namely GeoTIFF, NITF, HDF, JPEG, JPEG2000, and PNG. Another URL example is shown as follows: $ curl -XGET "http://sampleserver3.arcgisonline.com/ ArcGIS/services/World/Temperature/ImageServer/ WCSServer?SERVICE=WCS&VERSION=1.0.0 &REQUEST=GetCoverage&COVERAGE=1&CRS=EPSG:4326 &RESPONSE_CRS=EPSG:4326&BBOX=-158.203125,- 105.46875,158.203125,105.46875&WIDTH=500&HEIGHT=500&FORMAT=jpeg" -o coverage.jpeg  Summary In this article, we started with the differences between ArcGIS and GeoServer and then moved on to understanding standards. Then we went on to compare WMS with mapping service as well as WFS with feature access. Finally we successfully published a raster dataset with WCS. Resources for Article: Further resources on this subject: Getting Started with GeoServer [Article] Enterprise Geodatabase [Article] Sending Data to Google Docs [Article]
Read more
  • 0
  • 0
  • 1912

article-image-setting-qt-creator-android
Packt
27 Nov 2014
8 min read
Save for later

Setting up Qt Creator for Android

Packt
27 Nov 2014
8 min read
This article by Ray Rischpater, the author of the book Application Development with Qt Creator Second Edition, focusses on setting up Qt Creator for Android. Android's functionality is delimited in API levels; Qt for Android supports Android level 10 and above: that's Android 2.3.3, a variant of Gingerbread. Fortunately, most devices in the market today are at least Gingerbread, making Qt for Android a viable development platform for millions of devices. Downloading all the pieces To get started with Qt Creator for Android, you're going to need to download a lot of stuff. Let's get started: Begin with a release of Qt for Android, which was either. For this, you need to download it from http://qt-project.org/downloads. The Android developer tools require the current version of the Java Development Kit (JDK) (not just the runtime, the Java Runtime Environment, but the whole kit and caboodle); you can download it from http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html. You need the latest Android Software Development Kit (SDK), which you can download for Mac OS X, Linux, or Windows at http://developer.android.com/sdk/index.html. You need the latest Android Native Development Kit (NDK), which you can download at http://developer.android.com/tools/sdk/ndk/index.html. You need the current version of Ant, the Java build tool, which you can download at http://ant.apache.org/bindownload.cgi. Download, unzip, and install each of these, in the given order. On Windows, I installed the Android SDK and NDK by unzipping them to the root of my hard drive and installed the JDK at the default location I was offered. Setting environment variables Once you install the JDK, you need to be sure that you've set your JAVA_HOME environment variable to point to the directory where it was installed. How you will do this differs from platform to platform; on a Mac OS X or Linux box, you'd edit .bashrc, .tcshrc, or the likes; on Windows, go to System Properties, click on Environment Variables, and add the JAVA_HOME variable. The path should point to the base of the JDK directory; for me, it was C:Program FilesJavajdk1.7.0_25, although the path for you will depend on where you installed the JDK and which version you installed. (Make sure you set the path with the trailing directory separator; the Android SDK is pretty fussy about that sort of thing.) Next, you need to update your PATH to point to all the stuff you just installed. Again, this is an environment variable and you'll need to add the following: There are two different components/subsystems shown in the diagram. The first is YARN, which is the new resource management layer introduced in Hadoop 2.0. The second is HDFS. Let's first delve into HDFS since that has not changed much since Hadoop 1.0. The bin directory of your JDK The androidsdktools directory The androidsdkplatform-tools directory For me, on my Windows 8 computer, my PATH includes this now: …C:Program FilesJavajdk1.7.0_25bin;C:adt-bundle- windows-x86_64-20130729sdktools;;C:adt-bundlewindows-x86_64- 20130729sdkplatform-tools;… Don't forget the separators: on Windows, it's a semicolon (;), while on Mac OS X and Linux, it's a colon (:). An environment variable is a variable maintained by your operating system which affects its configuration; see http://en.wikipedia.org/wiki/Environment_variable for more details. At this point, it's a good idea to restart your computer (if you're running Windows) or log out and log in again (on Linux or Mac OS X) to make sure that all these settings take effect. If you're on a Mac OS X or Linux box, you might be able to start a new terminal and have the same effect (or reload your shell configuration file) instead, but I like the idea of restarting at this point to ensure that the next time I start everything up, it'll work correctly. Finishing the Android SDK installation Now, we need to use the Android SDK tools to ensure that you have a full version of the SDK for at least one Android API level installed. We'll need to start Eclipse, the Android SDK's development environment, and run the Android SDK manager. To do this, follow these steps: Find Eclipse. It's probably in the Eclipse directory of the directory where you installed the Android SDK. If Eclipse doesn't start, check your JAVA_HOME and PATH variables; the odds are that Eclipse will not find the Java environment it needs to run. Click on OK when Eclipse prompts you for a workspace. This doesn't matter; you won't use Eclipse except to download Android SDK components. Click on the Android SDK Manager button in the Eclipse toolbar (circled in the next screenshot): Make sure that you have at least one Android API level above API level 10 installed, along with the Google USB Driver (you'll need this to debug on the hardware). Quit Eclipse. Next, let's see whether the Android Debug Bridge—the software component that transfers your executables to your Android device and supports on-device debugging—is working as it should. Fire up a shell prompt and type adb. If you see a lot of output and no errors, the bridge is correctly installed. If not, go back and check your PATH variable to be sure it's correct. While you're at it, you should developer-enable your Android device too so that it'll work with ADB. Follow the steps provided at http://bit.ly/1a29sal. Configuring Qt Creator Now, it's time to tell Qt Creator about all the stuff you just installed. Perform the following steps: Start Qt Creator but don't create a new project. Under the Tools menu, select Options and then click on Android. Fill in the blanks, as shown in the next screenshot. They should be: The path to the SDK directory, in the directory where you installed the Android SDK. The path to where you installed the Android NDK. Check Automatically create kits for Android tool chains. The path to Ant; here, enter either the path to the Ant executable itself on Mac OS X and Linux platforms or the path to ant.bat in the bin directory of the directory where you unpacked Ant. The directory where you installed the JDK (this might be automatically picked up from your JAVA_HOME directory), as shown in the following screenshot: Click on OK to close the Options window. You should now be able to create a new Qt GUI or Qt Quick application for Android! Do so, and ensure that Android is a target option in the wizard, as the next screenshot shows; be sure to choose at least one ARM target, one x86 target, and one target for your desktop environment: If you want to add Android build configurations to an existing project, the process is slightly different. Perform the following steps: Load the project as you normally would. Click on Projects in the left-hand side pane. The Projects pane will open. Click on Add Kit and choose the desired Android (or other) device build kit. The following screenshot shows you where the Projects and Add Kit buttons are in Qt Creator: Building and running your application Write and build your application normally. A good idea is to build the Qt Quick Hello World application for Android first before you go to town and make a lot of changes, and test the environment by compiling for the device. When you're ready to run on the device, perform the following steps: Navigate to Projects (on the left-hand side) and then select the Android for arm kit's Run Settings. Under Package Configurations, ensure that the Android SDK level is set to the SDK level of the SDK you installed. Ensure that the Package name reads something similar to org.qtproject.example, followed by your project name. Connect your Android device to your computer using the USB cable. Select the Android for arm run target and then click on either Debug or Run to debug or run your application on the device. Summary Qt for Android gives you an excellent leg up on mobile development, but it's not a panacea. If you're planning to target mobile devices, you should be sure to have a good understanding of the usage patterns for your application's users as well as the constraints in CPU, GPU, memory, and network that a mobile application must run on. Once we understand these, however, all of our skills with Qt Creator and Qt carry over to the mobile arena. To develop for Android, begin by installing the JDK, Android SDK, Android NDK, and Ant, and then develop applications as usual: compiling for the device and running on the device frequently to iron out any unexpected problems along the way. Resources for Article: Further resources on this subject: Reversing Android Applications [article] Building Android (Must know) [article] Introducing an Android platform [article]
Read more
  • 0
  • 0
  • 11873

article-image-modernizing-our-spring-boot-app
Packt
26 Nov 2014
15 min read
Save for later

Modernizing our Spring Boot app

Packt
26 Nov 2014
15 min read
In this article by Greg L. Turnquist, the author of the book, Learning Spring Boot, we will discuss modernizing our Spring Boot app with JavaScript and adding production-ready support features. (For more resources related to this topic, see here.) Modernizing our app with JavaScript We just saw that, with a single @Grab statement, Spring Boot automatically configured the Thymeleaf template engine and some specialized view resolvers. We took advantage of Spring MVC's ability to pass attributes to the template through ModelAndView. Instead of figuring out the details of view resolvers, we instead channeled our efforts into building a handy template to render data fetched from the server. We didn't have to dig through reference docs, Google, and Stack Overflow to figure out how to configure and integrate Spring MVC with Thymeleaf. We let Spring Boot do the heavy lifting. But that's not enough, right? Any real application is going to also have some JavaScript. Love it or hate it, JavaScript is the engine for frontend web development. See how the following code lets us make things more modern by creating modern.groovy: @Grab("org.webjars:jquery:2.1.1")@Grab("thymeleaf-spring4")@Controllerclass ModernApp {def chapters = ["Quick Start With Groovy","Quick Start With Java","Debugging and Managing Your App","Data Access with Spring Boot","Securing Your App"]@RequestMapping("/")def home(@RequestParam(value="name", defaultValue="World")String n) {new ModelAndView("modern").addObject("name", n).addObject("chapters", chapters)}} A single @Grab statement pulls in jQuery 2.1.1. The rest of our server-side Groovy code is the same as before. There are multiple ways to use JavaScript libraries. For Java developers, it's especially convenient to use the WebJars project (http://webjars.org), where lots of handy JavaScript libraries are wrapped up with Maven coordinates. Every library is found on the /webjars/<library>/<version>/<module> path. To top it off, Spring Boot comes with prebuilt support. Perhaps you noticed this buried in earlier console outputs: ...2014-05-20 08:33:09.062 ... : Mapped URL path [/webjars/**] onto handlerof [...... With jQuery added to our application, we can amp up our template (templates/modern.html) like this: <html><head><title>Learning Spring Boot - Chapter 1</title><script src="webjars/jquery/2.1.1/jquery.min.js"></script><script>$(document).ready(function() {$('p').animate({fontSize: '48px',}, "slow");});</script></head><body><p th_text="'Hello, ' + ${name}"></p><ol><li th_each="chapter : ${chapters}"th:text="${chapter}"></li></ol></body></html> What's different between this template and the previous one? It has a couple extra <script> tags in the head section: The first one loads jQuery from /webjars/jquery/2.1.1/jquery.min.js (implying that we can also grab jquery.js if we want to debug jQuery) The second script looks for the <p> element containing our Hello, world! message and then performs an animation that increases the font size to 48 pixels after the DOM is fully loaded into the browser If we run spring run modern.groovy and visit http://localhost:8080, then we can see this simple but stylish animation. It shows us that all of jQuery is available for us to work with on our application. Using Bower instead of WebJars WebJars isn't the only option when it comes to adding JavaScript to our app. More sophisticated UI developers might use Bower (http://bower.io), a popular JavaScript library management tool. WebJars are useful for Java developers, but not every library has been bundled as a WebJar. There is also a huge community of frontend developers more familiar with Bower and NodeJS that will probably prefer using their standard tool chain to do their jobs. We'll see how to plug that into our app. First, it's important to know some basic options. Spring Boot supports serving up static web resources from the following paths: /META-INF/resources/ /resources/ /static/ /public/ To craft a Bower-based app with Spring Boot, we first need to craft a .bowerrc file in the same folder we plan to create our Spring Boot CLI application. Let's pick public/ as the folder of choice for JavaScript modules and put it in this file, as shown in the following code: {"directory": "public/"} Do I have to use public? No. Again, you can pick any of the folders listed previously and Spring Boot will serve up the code. It's a matter of taste and semantics. Our first step towards a Bower-based app is to define our project by answering a series of questions (this only has to be done once): $ bower init[?] name: app_with_bower[?] version: 0.1.0[?] description: Learning Spring Boot - bower sample[?] main file:[?] what types of modules does this package expose? amd[?] keywords:[?] authors: Greg Turnquist <[email protected]>[?] license: ASL[?] homepage: http://blog.greglturnquist.com/category/learning-springboot[?] set currently installed components as dependencies? No[?] add commonly ignored files to ignore list? Yes[?] would you like to mark this package as private which prevents it frombeing accidentally published to the registry? Yes...[?] Looks good? Yes Now that we have set our project, let's do something simple such as install jQuery with the following command: $ bower install jquery --savebower jquery#* cached git://github.com/jquery/jquery.git#2.1.1bower jquery#* validate 2.1.1 against git://github.com/jquery/jquery.git#* These two commands will have created the following bower.json file: {"name": "app_with_bower","version": "0.1.0","authors": ["Greg Turnquist <[email protected]>"],"description": "Learning Spring Boot - bower sample","license": "ASL","homepage": "http://blog.greglturnquist.com/category/learningspring-boot","private": true,"ignore": ["**/.*","node_modules","bower_components","public/","test","tests"],"dependencies": {"jquery": "~2.1.1"}} It will also have installed jQuery 2.1.1 into our app with the following directory structure: public└── jquery├── MIT-LICENSE.txt├── bower.json└── dist├── jquery.js└── jquery.min.js We must include --save (two dashes) whenever we install a module. This ensures that our bower.json file is updated at the same time, allowing us to rebuild things if needed. The altered version of our app with WebJars removed should now look like this: @Grab("thymeleaf-spring4")@Controllerclass ModernApp {def chapters = ["Quick Start With Groovy","Quick Start With Java","Debugging and Managing Your App","Data Access with Spring Boot","Securing Your App"]@RequestMapping("/")def home(@RequestParam(value="name", defaultValue="World")String n) {new ModelAndView("modern_with_bower").addObject("name", n).addObject("chapters", chapters)}} The view name has been changed to modern_with_bower, so it doesn't collide with the previous template if found in the same folder. This version of the template, templates/modern_with_bower.html, should look like this: <html><head><title>Learning Spring Boot - Chapter 1</title><script src="jquery/dist/jquery.min.js"></script><script>$(document).ready(function() {$('p').animate({fontSize: '48px',}, "slow");});</script></head><body><p th_text="'Hello, ' + ${name}"></p><ol><li th_each="chapter : ${chapters}"th:text="${chapter}"></li></ol></body></html> The path to jquery is now jquery/dist/jquery.min.js. The rest is the same as the WebJars example. We just launch the app with spring run modern_with_bower.groovy and navigate to http://localhost:8080. (Might need to refresh the page to ensure loading of the latest HTML.) The animation should work just the same. The options shown in this section can quickly give us a taste of how easy it is to use popular JavaScript tools with Spring Boot. We don't have to fiddle with messy tool chains to achieve a smooth integration. Instead, we can use them the way they are meant to be used. What about an app that is all frontend with no backend? Perhaps we're building an app that gets all its data from a remote backend. In this age of RESTful backends, it's not uncommon to build a single page frontend that is fed data updates via AJAX. Spring Boot's Groovy support provides the perfect and arguably smallest way to get started. We do so by creating pure_javascript.groovy, as shown in the following code: @Controllerclass JsApp { } That doesn't look like much, but it accomplishes a lot. Let's see what this tiny fragment of code actually does for us: The @Controller annotation, like @RestController, causes Spring Boot to auto-configure Spring MVC. Spring Boot will launch an embedded Apache Tomcat server. Spring Boot will serve up static content from resources, static, and public. Since there are no Spring MVC routes in this tiny fragment of code, things will fall to resource resolution. Next, we can create a static/index.html page as follows: <html>Greetings from pure HTML which can, in turn, load JavaScript!</html> Run spring run pure_javascript.groovy and navigate to http://localhost:8080. We will see the preceding plain text shown in our browser as expected. There is nothing here but pure HTML being served up by our embedded Apache Tomcat server. This is arguably the lightest way to serve up static content. Use spring jar and it's possible to easily bundle up our client-side app to be installed anywhere. Spring Boot's support for static HTML, JavaScript, and CSS opens the door to many options. We can add WebJar annotations to JsApp or use Bower to introduce third-party JavaScript libraries in addition to any custom client-side code. We might just manually download the JavaScript and CSS. No matter what option we choose, Spring Boot CLI certainly provides a super simple way to add rich-client power for app development. To top it off, RESTful backends that are decoupled from the frontend can have different iteration cycles as well as different development teams. You might need to configure CORS (http://spring.io/understanding/CORS) to properly handle making remote calls that don't go back to the original server. Adding production-ready support features So far, we have created a Spring MVC app with minimal code. We added views and JavaScript. We are on the verge of a production release. Before deploying our rapidly built and modernized web application, we might want to think about potential issues that might arise in production: What do we do when the system administrator wants to configure his monitoring software to ping our app to see if it's up? What happens when our manager wants to know the metrics of people hitting our app? What are we going to do when the Ops center supervisor calls us at 2:00 a.m. and we have to figure out what went wrong? The last feature we are going to introduce in this article is Spring Boot's Actuator module and CRaSH remote shell support (http://www.crashub.org). These two modules provide some super slick, Ops-oriented features that are incredibly valuable in a production environment. We first need to update our previous code (we'll call it ops.groovy), as shown in the following code: @Grab("spring-boot-actuator")@Grab("spring-boot-starter-remote-shell")@Grab("org.webjars:jquery:2.1.1")@Grab("thymeleaf-spring4")@Controllerclass OpsReadyApp {@RequestMapping("/")def home(@RequestParam(value="name", defaultValue="World")String n) {new ModelAndView("modern").addObject("name", n)}} This app is exactly like the WebJars example with two key differences: it adds @Grab("spring-boot-actuator") and @Grab("spring-boot-starter-remote-shell"). When you run this version of our app, the same business functionality is available that we saw earlier, but there are additional HTTP endpoints available: Actuator endpoint Description /autoconfig This reports what Spring Boot did and didn't auto-configure and why /beans This reports all the beans configured in the application context (including ours as well as the ones auto-configured by Boot) /configprops This exposes all configuration properties /dump This creates a thread dump report /env This reports on the current system environment /health This is a simple endpoint to check life of the app /info This serves up custom content from the app /metrics This shows counters and gauges on web usage /mappings This gives us details about all Spring MVC routes /trace This shows details about past requests Pinging our app for general health Each of these endpoints can be visited using our browser or using other tools such as curl. For example, let's assume we ran spring run ops.groovy and then opened up another shell. From the second shell, let's run the following curl command: $ curl localhost:8080/health{"status":"UP"} This immediately solves our first need listed previously. We can inform the system administrator that he or she can write a management script to interrogate our app's health. Gathering metrics Be warned that each of these endpoints serves up a compact JSON document. Generally speaking, command-line curl probably isn't the best option. While it's convenient on *nix and Mac systems, the content is dense and hard to read. It's more practical to have: A JSON plugin installed in our browser (such as JSONView at http://jsonview.com) A script that uses a JSON parsing library if we're writing a management script (such as Groovy's JsonSlurper at http://groovy.codehaus.org/gapi/groovy/json/JsonSlurper.html or JSONPath at https://code.google.com/p/json-path) Assuming we have JSONView installed, the following screenshot shows a listing of metrics: It lists counters for each HTTP endpoint. According to this, /metrics has been visited four times with a successful 200 status code. Someone tried to access /foo, but it failed with a 404 error code. The report also lists gauges for each endpoint, reporting the last response time. In this case, /metrics took 2 milliseconds. Also included are some memory stats as well as the total CPUs available. It's important to realize that the metrics start at 0. To generate some numbers, you might want to first click on some links before visiting /metrics. The following screenshot shows a trace report: It shows the entire web request and response for curl localhost:8080/health. This provides a basic framework of metrics to satisfy our manager's needs. It's important to understand that metrics gathered by Spring Boot Actuator aren't persistent across application restarts. So to gather long-term data, we have to gather them and then write them elsewhere. With these options, we can perform the following: Write a script that gathers metrics every hour and appends them to a running spreadsheet somewhere else in the filesystem, such as a shared drive. This might be simple, but probably also crude. To step it up, we can dump the data into a Hadoop filesystem for raw collection and configure Spring XD (http://projects.spring.io/spring-xd/) to consume it. Spring XD stands for Spring eXtreme Data. It is an open source product that makes it incredibly easy to chain together sources and sinks comprised of many components, such as HTTP endpoints, Hadoop filesystems, Redis metrics, and RabbitMQ messaging. Unfortunately, there is no space to dive into this subject. With any monitoring, it's important to check that we aren't taxing the system too heavily. The same container responding to business-related web requests is also serving metrics data, so it will be wise to engage profilers periodically to ensure that the whole system is performing as expected. Detailed management with CRaSH So what can we do when we receive that 2:00 a.m. phone call from the Ops center? After either coming in or logging in remotely, we can access the convenient CRaSH shell we configured. Every time the app launches, it generates a random password for SSH access and prints this to the local console: 2014-06-11 23:00:18.822 ... : Configuring property ssh.port=2000 fromproperties2014-06-11 23:00:18.823 ... : Configuring property ssh.authtimeout=600000 fro...2014-06-11 23:00:18.824 ... : Configuring property ssh.idletimeout=600000 fro...2014-06-11 23:00:18.824 ... : Configuring property auth=simple fromproperties2014-06-11 23:00:18.824 ... : Configuring property auth.simple.username=user f...2014-06-11 23:00:18.824 ... : Configuring property auth.simple.password=bdbe4a... We can easily see that there's SSH access on port 2000 via a user if we use this information to log in: $ ssh -p 2000 user@localhostPassword authenticationPassword:. ____ _ __ _ _/\ / ___'_ __ _ _(_)_ __ __ _ ( ( )___ | '_ | '_| | '_ / _' | \/ ___)| |_)| | | | | || (_| | ) ) ) )' |____| .__|_| |_|_| |___, | / / / /=========|_|==============|___/=/_/_/_/:: Spring Boot :: (v1.1.6.RELEASE) on retina> There's a fistful of commands: help: This gets a listing of available commands dashboard: This gets a graphic, text-based display of all the threads, environment properties, memory, and other things autoconfig: This prints out a report of which Spring Boot auto-configuration rules were applied and which were skipped (and why) All of the previous commands have man pages: > man autoconfigNAMEautoconfig - Display auto configuration report fromApplicationContextSYNOPSISautoconfig [-h | --help]STREAMautoconfig <java.lang.Void, java.lang.Object>PARAMETERS[-h | --help]Display this help message... There are many commands available to help manage our application. More details are available at http://www.crashub.org/1.3/reference.html. Summary In this article, we learned about modernizing our Spring Boot app with JavaScript and adding production-ready support features. We plugged in Spring Boot's Actuator module as well as the CRaSH remote shell, configuring it with metrics, health, and management features so that we can monitor it in production by merely adding two lines of extra code. Resources for Article: Further resources on this subject: Getting Started with Spring Security [Article] Spring Roo 1.1: Working with Roo-generated Web Applications [Article] Spring Security 3: Tips and Tricks [Article]
Read more
  • 0
  • 0
  • 2440

article-image-concurrency-practice
Packt
26 Nov 2014
25 min read
Save for later

Concurrency in Practice

Packt
26 Nov 2014
25 min read
This article written by Aleksandar Prokopec, the author of Learning Concurrent Programming in Scala, helps you develop skills that are necessary to write correct and efficient concurrent programs. It teaches you about concurrency in Scala through a sequence of programs. (For more resources related to this topic, see here.) "The best theory is inspired by practice."                                          -Donald Knuth We have studied a plethora of different concurrency facilities in this article. By now, you will have learned about dozens of different ways of starting concurrent computations and accessing shared data. Knowing how to use different styles of concurrency is useful, but it might not yet be obvious when to use which. The goal of this article is to introduce the big picture of concurrent programming. We will study the use cases for various concurrency abstractions, see how to debug concurrent programs, and how to integrate different concurrency libraries in larger applications. In this article, we perform the following tasks: Investigate how to deal with various kinds of bugs appearing in concurrent applications Learn how to identify and resolve performance bottlenecks Apply the previous knowledge about concurrency to implement a larger concurrent application, namely, a remote file browser We start with an overview of the important concurrency frameworks that we learned about in this article, and a summary of when to use each of them. Choosing the right tools for the job In this section, we present an overview of the different concurrency libraries that we learned about. We take a step back and look at the differences between these libraries, and what they have in common. This summary will give us an insight into what different concurrency abstractions are useful for. A concurrency framework usually needs to address several concerns: It must provide a way to declare data that is shared between concurrent executions It must provide constructs for reading and modifying program data It must be able to express conditional execution, triggered when a certain set of conditions are fulfilled It must define a way to start concurrent executions Some of the frameworks from this article address all of these concerns; others address only a subset, and transfer part of the responsibility to another framework. Typically, in a concurrent programming model, we express concurrently shared data differently from data intended to be accessed only from a single thread. This allows the JVM runtime to optimize sequential parts of the program more effectively. So far, we've seen a lot of different ways to express concurrently shared data, ranging from the low-level facilities to advanced high-level abstractions. We summarize different data abstractions in the following table: Data abstraction Datatype or annotation Description Volatile variables (JDK) @volatile Ensure visibility and the happens-before relationship on class fields and local variables that are captured in closures. Atomic variables (JDK) AtomicReference[T] AtomicInteger AtomicLong Provide basic composite atomic operations, such as compareAndSet and incrementAndGet. Futures and promises (scala.concurrent) Future[T] Promise[T] Sometimes called single-assignment variables, these express values that might not be computed yet, but will eventually become available. Observables and subjects (Rx) Observable[T] Subject[T] Also known as first-class event streams, these describe many different values that arrive one after another in time. Transactional references (Scala Software Transactional Memory (STM)) Ref[T] These describe memory locations that can only be accessed from within memory transactions. Their modifications only become visible after the transaction successfully commits. The next important concern is providing access to shared data, which includes reading and modifying shared memory locations. Usually, a concurrent program uses special constructs to express such accesses. We summarize the different data access constructs in the following table: Data abstraction Data access constructs Description Arbitrary data (JDK) synchronized   Uses intrinsic object locks to exclude access to arbitrary shared data. Atomic variables and classes (JDK) compareAndSet Atomically exchanges the value of a single memory location. It allows implementing lock-free programs. Futures and promises (scala.concurrent) value tryComplete Used to assign a value to a promise, or to check the value of the corresponding future. The value method is not a preferred way to interact with a future. Transactional references (ScalaSTM) atomic orAtomic single Atomically modify the values of a set of memory locations. Reduces the risk of deadlocks, but disallow side effects inside the transactional block. Concurrent data access is not the only concern of a concurrency framework. Concurrent computations sometimes need to proceed only after a certain condition is met. In the following table, we summarize different constructs that enable this: Concurrency framework Conditional execution constructs Description JVM concurrency wait notify notifyAll Used to suspend the execution of a thread until some other thread notifies that the conditions are met. Futures and promises onComplete Await.ready Conditionally schedules an asynchronous computation. The Await.ready method suspends the thread until the future completes. Reactive extensions subscribe Asynchronously or synchronously executes a computation when an event arrives. Software transactional memory retry retryFor withRetryTimeout Retries the current memory transaction when some of the relevant memory locations change. Actors receive Executes the actor's receive block when a message arrives. Finally, a concurrency model must define a way to start a concurrent execution. We summarize different concurrency constructs in the following table: Concurrency framework Concurrency constructs Description JVM concurrency Thread.start Starts a new thread of execution. Execution contexts execute Schedules a block of code for execution on a thread pool. Futures and promises Future.apply Schedules a block of code for execution, and returns the future value with the result of the execution. Parallel collections par Allows invoking data-parallel versions of collection methods. Reactive extensions Observable.create observeOn The create method defines an event source. The observeOn method schedules the handling of events on different threads. Actors actorOf Schedules a new actor object for execution. This breakdown shows us that different concurrency libraries focus on different tasks. For example, parallel collections do not have conditional waiting constructs, because a data-parallel operation proceeds on separate elements independently. Similarly, software transactional memory does not come with a construct to express concurrent computations, and focuses only on protecting access to shared data. Actors do not have special constructs for modeling shared data and protecting access to it, because data is encapsulated within separate actors and accessed serially only by the actor that owns it. Having classified concurrency libraries according to how they model shared data and express concurrency, we present a summary of what different concurrency libraries are good for: The classical JVM concurrency model uses threads, the synchronized statement, volatile variables, and atomic primitives for low-level tasks. Uses include implementing a custom concurrency utility, a concurrent data structure, or a concurrency framework optimized for specific tasks. Futures and promises are best suited for referring to concurrent computations that produce a single result value. Futures model latency in the program, and allow composing values that become available later during the execution of the program. Uses include performing remote network requests and waiting for replies, referring to the result of an asynchronous long-running computation, or reacting to the completion of an I/O operation. Futures are usually the glue of a concurrent application, binding the different parts of a concurrent program together. We often use futures to convert single-event callback APIs into a standardized representation based on the Future type. Parallel collections are best suited for efficiently executing data-parallel operations on large datasets. Usages include file searching, text processing, linear algebra applications, numerical computations, and simulations. Long-running Scala collection operations are usually good candidates for parallelization. Reactive extensions are used to express asynchronous event-based programs. Unlike parallel collections, in reactive extensions, data elements are not available when the operation starts, but arrive while the application is running. Uses include converting callback-based APIs, modeling events in user interfaces, modeling events external to the application, manipulating program events with collection-style combinators, streaming data from input devices or remote locations, or incrementally propagating changes in the data model throughout the program. Use STM to protect program data from getting corrupted by concurrent accesses. An STM allows building complex data models and accessing them with the reduced risk of deadlocks and race conditions. A typical use is to protect concurrently accessible data, while retaining good scalability between threads whose accesses to data do not overlap. Actors are suitable for encapsulating concurrently accessible data, and seamlessly building distributed systems. Actor frameworks provide a natural way to express concurrent tasks that communicate by explicitly sending messages. Uses include serializing concurrent access to data to prevent corruption, expressing stateful concurrency units in the system, and building distributed applications like trading systems, P2P networks, communication hubs, or data mining frameworks. Advocates of specific programming languages, libraries, or frameworks might try to convince you that their technology is the best for any task and any situation, often with the intent of selling it. Richard Stallman once said how computer science is the only industry more fashion-driven than women's fashion. As engineers, we need to know better than to succumb to programming fashion and marketing propaganda. Different frameworks are tailored towards specific use cases, and the correct way to choose a technology is to carefully weigh its advantages and disadvantages when applied to a specific situation. There is no one-size-fits-all technology. Use your own best judgment when deciding which concurrency framework to use for a specific programming task. Sometimes, choosing the best-suited concurrency utility is easier said than done. It takes a great deal of experience to choose the correct technology. In many cases, we do not even know enough about the requirements of the system to make an informed decision. Regardless, a good rule of thumb is to apply several concurrency frameworks to different parts of the same application, each best suited for a specific task. Often, the real power of different concurrency frameworks becomes apparent when they are used together. This is the topic of the next section. Putting it all together – a remote file browser In this section, we use our knowledge about different concurrency frameworks to build a remote file browser. This larger application example illustrates how different concurrency libraries work together, and how to apply them to different situations. We will name our remote file browser ScalaFTP. The ScalaFTP browser is divided into two main components: the server and the client process. The server process will run on the machine whose filesystem we want to manipulate. The client will run on our own computer, and comprise of a graphical user interface used to navigate the remote filesystem. To keep things simple, the protocol that the client and the server will use to communicate will not really be FTP, but a custom communication protocol. By choosing the correct concurrency libraries to implement different parts of ScalaFTP, we will ensure that the complete ScalaFTP implementation fits inside just 500 lines of code. Specifically, the ScalaFTP browser will implement the following features: Displaying the names of the files and the directories in a remote filesystem, and allow navigating through the directory structure Copying files between directories in a remote filesystem Deleting files in a remote filesystem To implement separate pieces of this functionality, we will divide the ScalaFTP server and client programs into layers. The task of the server program is to answer to incoming copy and delete requests, and to answer queries about the contents of specific directories. To make sure that its view of the filesystem is consistent, the server will cache the directory structure of the filesystem. We divide the server program into two layers: the filesystem API and the server interface. The filesystem API will expose the data model of the server program, and define useful utility methods to manipulate the filesystem. The server interface will receive requests and send responses back to the client. Since the server interface will require communicating with the remote client, we decide to use the Akka actor framework. Akka comes with remote communication facilities. The contents of the filesystem, that is, its state, will change over time. We are therefore interested in choosing proper constructs for data access. In the filesystem API, we can use object monitors and locking to synchronize access to shared state, but we will avoid these due to the risk of deadlocks. We similarly avoid using atomic variables, because they are prone to race conditions. We could encapsulate the filesystem state within an actor, but note that this can lead to a scalability bottleneck:an actor would serialize all accesses to the filesystem state. Therefore, we decide to use the ScalaSTM framework to model the filesystem contents. An STM avoids the risk of deadlocks and race conditions, and ensures good horizontal scalability. The task of the client program will be to graphically present the contents of the remote filesystem, and communicate with the server. We divide the client program into three layers of functionality. The GUI layer will render the contents of the remote filesystem and register user requests such as button clicks. The client API will replicate the server interface on the client side and communicate with the server. We will use Akka to communicate with the server, but expose the results of remote operations as futures. Finally, the client logic will be a gluing layer, which binds the GUI and the client API together. The architecture of the ScalaFTP browser is illustrated in the following diagram, in which we indicate which concurrency libraries will be used by separate layers. The dashed line represents the communication path between the client and the server: We now start by implementing the ScalaFTP server, relying on the bottom-up design approach. In the next section, we will describe the internals of the filesystem API. Modeling the filesystem We used atomic variables and concurrent collections to implement a non-blocking, thread-safe filesystem API, which allowed copying files and retrieving snapshots of the filesystem. In this section, we repeat this task using STM. We will see that it is much more intuitive and less error-prone to use an STM. We start by defining the different states that a file can be in. The file can be currently created, in the idle state, being copied, or being deleted. We model this with a sealed State trait, and its four cases: sealed trait Statecase object Created extends Statecase object Idle extends Statecase class Copying(n: Int) extends Statecase object Deleted extends State A file can only be deleted if it is in the idle state, and it can only be copied if it is in the idle state or in the copied state. Since a file can be copied to multiple destinations at a time, the Copying state encodes how many copies are currently under way. We add the methods inc and dec to the State trait, which return a new state with one more or one fewer copy, respectively. For example, the implementation of inc and dec for the Copying state is as follows: def inc: State = Copying(n + 1)def dec: State = if (n > 1) Copying(n - 1) else Idle Similar to the File class in the java.io package, we represent both the files and directories with the same entity, and refer to them more generally as files. Each file is represented by the FileInfo class that encodes the path, its name, its parent directory, and the date of the last modification to the file; a Boolean value denoting if the file is a directory, the size of the file, and its State object. The FileInfo class is immutable, and updating the state of the file will require creating a fresh FileInfo object: case class FileInfo(path: String, name: String,parent: String, modified: String, isDir: Boolean,size: Long, state: State) We separately define the factory methods apply and creating that take a File object and return a FileInfo object in the Idle or Created state, respectively. Depending on where the server is started, the root of the ScalaFTP directory structure is a different subdirectory in the actual filesystem. A FileSystem object tracks the files in the given rootpath directory, using a transactional map called files: class FileSystem(val rootpath: String) {val files = TMap[String, FileInfo]()} We introduce a separate init method to initialize the FileSystem object. The init method starts a transaction, clears the contents of the files map, and traverses the files and directories under rootpath using the Apache Commons IO library. For each file and directory, the init method creates a FileInfo object and adds it to the files map, using its path as the key: def init() = atomic { implicit txn =>files.clear()val rootDir = new File(rootpath)val all = TrueFileFilter.INSTANCEval fileIterator =FileUtils.iterateFilesAndDirs(rootDir, all, all).asScalafor (file <- fileIterator) {val info = FileInfo(file)files(info.path) = info} Recall that the ScalaFTP browser must display the contents of the remote filesystem. To enable directory queries, we first add the getFileList method to the FileSystem class, which retrieves the files in the specified dir directory. The getFileList method starts a transaction and filters the files whose direct parent is equal to dir: def getFileList(dir: String): Map[String, FileInfo] =atomic { implicit txn =>files.filter(_._2.parent == dir)} We implement the copying logic in the filesystem API with the copyFile method. This method takes a path to the src source file and the dest destination file, and starts a transaction. After checking whether the dest destination file exists or not, the copyFile method inspects the state of the source file entry, and fails unless the state is Idle or Copying. It then calls inc to create a new state with the increased copy count, and updates the source file entry in the files map with the new state. Similarly, the copyFile method creates a new entry for the destination file in the files map. Finally, the copyFile method calls the afterCommit handler to physically copy the file to disk after the transaction completes. Recall that it is not legal to execute side-effecting operations from within the transaction body, so the private copyOnDisk method is called only after the transaction commits: def copyFile(src: String, dest: String) = atomic { implicit txn =>val srcfile = new File(src)val destfile = new File(dest)val info = files(src)if (files.contains(dest)) sys.error(s"Destination exists.")info.state match {case Idle | Copying(_) =>files(src) = info.copy(state = info.state.inc)files(dest) = FileInfo.creating(destfile, info.size)Txn.afterCommit { _ => copyOnDisk(srcfile, destfile) }src}} The copyOnDisk method calls the copyFile method on the FileUtils class from the Apache Commons IO library. After the file transfer completes, the copyOnDisk method starts another transaction, in which it decreases the copy count of the source file and sets the state of the destination file to Idle: private def copyOnDisk(srcfile: File, destfile: File) = {FileUtils.copyFile(srcfile, destfile)atomic { implicit txn =>val ninfo = files(srcfile.getPath)files(srcfile.getPath) = ninfo.copy(state = ninfo.state.dec)files(destfile.getPath) = FileInfo(destfile)}} The deleteFile method deletes a file in a similar way. It changes the file state to Deleted, deletes the file, and starts another transaction to remove the file entry: def deleteFile(srcpath: String): String = atomic { implicit txn =>val info = files(srcpath)info.state match {case Idle =>files(srcpath) = info.copy(state = Deleted)Txn.afterCommit { _ =>FileUtils.forceDelete(info.toFile)files.single.remove(srcpath)}srcpath}} Modeling the server data model with the STM allows seamlessly adding different concurrent computations to the server program. In the next section, we will implement a server actor that uses the server API to execute filesystem operations. Use STM to model concurrently accessible data, as an STM works transparently with most concurrency frameworks. Having completed the filesystem API, we now proceed to the server interface layer of the ScalaFTP browser. The Server interface The server interface comprises of a single actor called FTPServerActor. This actor will receive client requests and respond to them serially. If it turns out that the server actor is the sequential bottleneck of the system, we can simply add additional server interface actors to improve horizontal scalability. We start by defining the different types of messages that the server actor can receive. We follow the convention of defining them inside the companion object of the FTPServerActor class: object FTPServerActor {sealed trait Commandcase class GetFileList(dir: String) extends Commandcase class CopyFile(src: String, dest: String) extends Commandcase class DeleteFile(path: String) extends Commanddef apply(fs: FileSystem) = Props(classOf[FTPServerActor], fs)} The actor template of the server actor takes a FileSystem object as a parameter. It reacts to the GetFileList, CopyFile, and DeleteFile messages by calling the appropriate methods from the filesystem API: class FTPServerActor(fileSystem: FileSystem) extends Actor {val log = Logging(context.system, this)def receive = {case GetFileList(dir) =>val filesMap = fileSystem.getFileList(dir)val files = filesMap.map(_._2).to[Seq]sender ! filescase CopyFile(srcpath, destpath) =>Future {Try(fileSystem.copyFile(srcpath, destpath))} pipeTo sendercase DeleteFile(path) =>Future {Try(fileSystem.deleteFile(path))} pipeTo sender}} When the server receives a GetFileList message, it calls the getFileList method with the specified dir directory, and sends a sequence collection with the FileInfo objects back to the client. Since FileInfo is a case class, it extends the Serializable interface, and its instances can be sent over the network. When the server receives a CopyFile or DeleteFile message, it calls the appropriate filesystem method asynchronously. The methods in the filesystem API throw exceptions when something goes wrong, so we need to wrap calls to them in Try objects. After the asynchronous file operations complete, the resulting Try objects are piped back as messages to the sender actor, using the Akka pipeTo method. To start the ScalaFTP server, we need to instantiate and initialize a FileSystem object, and start the server actor. We parse the network port command-line argument, and use it to create an actor system that is capable of remote communication. For this, we use the remotingSystem factory method that we introduced. The remoting actor system then creates an instance of the FTPServerActor. This is shown in the following program: object FTPServer extends App {val fileSystem = new FileSystem(".")fileSystem.init()val port = args(0).toIntval actorSystem = ch8.remotingSystem("FTPServerSystem", port)actorSystem.actorOf(FTPServerActor(fileSystem), "server")} The ScalaFTP server actor can run inside the same process as the client application, in another process in the same machine, or on a different machine connected with a network. The advantage of the actor model is that we usually need not worry about where the actor runs until we integrate it into the entire application. When you need to implement a distributed application that runs on different machines, use an actor framework. Our server program is now complete, and we can run it with the run command from SBT. We set the actor system to use the port 12345: run 12345 In the next section, we will implement the file navigation API for the ScalaFTP client, which will communicate with the server interface over the network. Client navigation API The client API exposes the server interfaces to the client program through asynchronous methods that return future objects. Unlike the server's filesystem API, which runs locally, the client API methods execute remote network requests. Futures are a natural way to model latency in the client API methods, and to avoid blocking during the network requests. Internally, the client API maintains an actor instance that communicates with the server actor. The client actor does not know the actor reference of the server actor when it is created. For this reason, the client actor starts in an unconnected state. When it receives the Start message with the URL of the server actor system, the client constructs an actor path to the server actor, sends out an Identify message, and switches to the connecting state. If the actor system is able to find the server actor, the client actor eventually receives the ActorIdentity message with the server actor reference. In this case, the client actor switches to the connected state, and is able to forward commands to the server. Otherwise, the connection fails and the client actor reverts to the unconnected state. The state diagram of the client actor is shown in the following figure: We define the Start message in the client actor's companion object: object FTPClientActor {case class Start(host: String)} We then define the FTPClientActor class and give it an implicit Timeout parameter. The Timeout parameter will be used later in the Akka ask pattern, when forwarding client requests to the server actor. The stub of the FTPClientActor class is as follows: class FTPClientActor(implicit val timeout: Timeout)extends Actor Before defining the receive method, we define behaviors corresponding to different actor states. Once the client actor in the unconnected state receives the Start message with the host string, it constructs an actor path to the server, and creates an actor selection object. The client actor then sends the Identify message to the actor selection, and switches its behavior to connecting. This is shown in the following behavior method, named unconnected: def unconnected: Actor.Receive = {case Start(host) =>val serverActorPath =s"akka.tcp://FTPServerSystem@$host/user/server"val serverActorSel = context.actorSelection(serverActorPath)serverActorSel ! Identify(())context.become(connecting(sender))} The connecting method creates a behavior given an actor reference to the sender of the Start message. We call this actor reference clientApp, because the ScalaFTP client application will send the Start message to the client actor. Once the client actor receives an ActorIdentity message with the ref reference to the server actor, it can send true back to the clientApp reference, indicating that the connection was successful. In this case, the client actor switches to the connected behavior. Otherwise, if the client actor receives an ActorIdentity message without the server reference, the client actor sends false back to the application, and reverts to the unconnected state: def connecting(clientApp: ActorRef): Actor.Receive = {case ActorIdentity(_, Some(ref)) =>clientApp ! truecontext.become(connected(ref))case ActorIdentity(_, None) =>clientApp ! falsecontext.become(unconnected)} The connected state uses the serverActor server actor reference to forward the Command messages. To do so, the client actor uses the Akka ask pattern, which returns a future object with the server's response. The contents of the future are piped back to the original sender of the Command message. In this way, the client actor serves as an intermediary between the application, which is the sender, and the server actor. The connected method is shown in the following code snippet: def connected(serverActor: ActorRef): Actor.Receive = {case command: Command =>(serverActor ? command).pipeTo(sender)} Finally, the receive method returns the unconnected behavior, in which the client actor is created: def receive = unconnected Having implemented the client actor, we can proceed to the client API layer. We model it as a trait with a connected value, the concrete methods getFileList, copyFile, and deleteFile, and an abstract host method. The client API creates a private remoting actor system and a client actor. It then instantiates the connected future that computes the connection status by sending a Start message to the client actor. The methods getFileList, copyFile, and deleteFile are similar. They use the ask pattern on the client actor to obtain a future with the response. Recall that the actor messages are not typed, and the ask pattern returns a Future[Any] object. For this reason, each method in the client API uses the mapTo future combinator to restore the type of the message: trait FTPClientApi {implicit val timeout: Timeout = Timeout(4 seconds)private val props = Props(classOf[FTPClientActor], timeout)private val system = ch8.remotingSystem("FTPClientSystem", 0)private val clientActor = system.actorOf(props)def host: Stringval connected: Future[Boolean] = {val f = clientActor ? FTPClientActor.Startf.mapTo[Boolean]}def getFileList(d: String): Future[(String, Seq[FileInfo])] = {val f = clientActor ? FTPServerActor.GetFileList(d)f.mapTo[Seq[FileInfo]].map(fs => (d, fs))}def copyFile(src: String, dest: String): Future[String] = {val f = clientActor ? FTPServerActor.CopyFile(src, dest)f.mapTo[Try[String]].map(_.get)}def deleteFile(srcpath: String): Future[String] = {val f = clientActor ? FTPServerActor.DeleteFile(srcpath)f.mapTo[Try[String]].map(_.get)}} Note that the client API does not expose the fact that it uses actors for remote communication. Moreover, the client API is similar to the server API, but the return types of the methods are futures instead of normal values. Futures encode the latency of a method without exposing the cause for the latency, so we often find them at the boundaries between different APIs. We can internally replace the actor communication between the client and the server with the remote Observable objects, but that would not change the client API. In a concurrent application, use futures at the boundaries of the layers to express latency. Now that we can programmatically communicate with the remote ScalaFTP server, we turn our attention to the user interface of the client program. Summary This article summarized the different concurrency libraries introduced to us. In this article, you learned how to choose the correct concurrent abstraction to solve a given problem. We learned to combine different concurrency abstractions together when designing larger concurrent applications. Resources for Article: Further resources on this subject: Creating Java EE Applications [Article] Differences in style between Java and Scala code [Article] Integrating Scala, Groovy, and Flex Development with Apache Maven [Article]
Read more
  • 0
  • 0
  • 1450
Packt
25 Nov 2014
7 min read
Save for later

Creating an Apache JMeter™ test workbench

Packt
25 Nov 2014
7 min read
This article is written by Colin Henderson, the author of Mastering GeoServer. This article will give you a brief introduction about how to create an Apache JMeter™ test workbench. (For more resources related to this topic, see here.) Before we can get into the nitty-gritty of creating a test workbench for Apache JMeter™, we must download and install it. Apache JMeter™ is a 100 percent Java application, which means that it will run on any platform provided there is a Java 6 or higher runtime environment present. The binaries can be downloaded from http://jmeter.apache.org/download_jmeter.cgi, and at the time of writing, the latest version is 2.11. No installation is required; just download the ZIP file and decompress it to a location you can access from a command-line prompt or shell environment. To launch JMeter on Linux, simply open shell and enter the following command: $ cd <path_to_jmeter>/bin$ ./jmeter To launch JMeter on Windows, simply open a command prompt and enter the following command: C:> cd <path_to_jmeter>\binC:> jmeter After a short time, JMeter GUI should appear, where we can construct our test plan. For ease and convenience, consider setting your system's PATH environment variable to the location of the JMeter bin directory. In future, you will be able to launch JMeter from the command line without having to CD first. The JMeter workbench will open with an empty configuration ready for us to construct our test strategy: The first thing we need to do is give our test plan a name; for now, let's call it GeoServer Stress Test. We can also provide some comments, which is good practice as it will help us remember for what reason we devised the test plan in future. To demonstrate the use of JMeter, we will create a very simple test plan. In this test plan, we will simulate a certain number of users hitting our GeoServer concurrently and requesting maps. To set this up, we first need to add Thread Group to our test plan. In a JMeter test, a thread is equivalent to a user: In the left-hand side menu, we need to right-click on the GeoServer Stress Test node and choose the Add | Threads (Users) | Thread Group menu option. This will add a child node to the test plan that we right-clicked on. The right-hand side panel provides options that we can set for the thread group to control how the user requests are executed. For example, we can name it something meaningful, such as Web Map Requests. In this test, we will simulate 30 users, making map requests over a total duration of 10 minutes, with a 10-second delay between each user starting. The number of users is set by entering a value for Number of Threads; in this case, 30. The Ramp-Up Period option controls the delay in starting each user by specifying the duration in which all the threads must start. So, in our case, we enter a duration of 300 seconds, which means all 30 users will be started by the end of 300 seconds. This equates to a 10-second delay between starting threads (300 / 30 = 10). Finally, we will set a duration for the test to run over by ticking the box for Scheduler, and then specifying a value of 600 seconds for Duration. By specifying a duration value, we override the End Time setting. Next, we need to provide some basic configuration elements for our test. First, we need to set the default parameters for all web requests. Right-click on the Web Map Requests thread group node that we just created, and then navigate to Add | Config Element | User Defined Variables. This will add a new node in which we can specify the default HTTP request parameters for our test: In the right-hand side panel, we can specify any number of variables. We can use these as replacement tokens later when we configure the web requests that will be sent during our test run. In this panel, we specify all the standard WMS query parameters that we don't anticipate changing across requests. Taking this approach is a good practice as it means that we can create a mix of tests using the same values, so if we change one, we don't have to change all the different test elements. To execute requests, we need to add Logic Controller. JMeter contains a lot of different logic controllers, but in this instance, we will use Simple Controller to execute a request. To add the controller, right-click on the Web Map Requests node and navigate to Add | Logic Controller | Simple Controller. A simple controller does not require any configuration; it is merely a container for activities we want to execute. In our case, we want the controller to read some data from our CSV file, and then execute an HTTP request to WMS. To do this, we need to add a CSV dataset configuration. Right-click on the Simple Controller node and navigate to Add | Config Element | CSV Data Set Config. The settings for the CSV data are pretty straightforward. The filename is set to the file that we generated previously, containing the random WMS request properties. The path can be specified as relative or absolute. The Variable Names property is where we specify the structure of the CSV file. The Recycle on EOF option is important as it means that the CSV file will be re-read when the end of the file is reached. Finally, we need to set Sharing mode to All threads to ensure the data can be used across threads. Next, we need to add a delay to our requests to simulate user activity; in this case, we will introduce a small delay of 5 seconds to simulate a user performing a map-pan operation. Right-click on the Simple Controller node, and then navigate to Add | Timer | Constant Timer: Simply specify the value we want the thread to be paused for in milliseconds. Finally, we need to add a JMeter sampler, which is the unit that will actually perform the HTTP request. Right-click on the Simple Controller node and navigate to Add | Sampler | HTTP Request. This will add an HTTP Request sampler to the test plan: There is a lot of information that goes into this panel; however, all it does is construct an HTTP request that the thread will execute. We specify the server name or IP address along with the HTTP method to use. The important part of this panel is the Parameters tab, which is where we need to specify all the WMS request parameters. Notice that we used the tokens that we specified in the CSV Data Set Config and WMS Request Defaults configuration components. We use the ${token_name} token, and JMeter replaces the token with the appropriate value of the referenced variable. We configured our test plan, but before we execute it, we need to add some listeners to the plan. A JMeter listener is the component that will gather the information from all of the test runs that occur. We add listeners by right-clicking on the thread group node and then navigating to the Add | Listeners menu option. A list of available listeners is displayed, and we can select the one we want to add. For our purposes, we will add the Graph Results, Generate Summary Results, Summary Report, and Response Time Graph listeners. Each listener can have its output saved to a datafile for later review. When completed, our test plan structure should look like the following: Before executing the plan, we should save it for use later. Summary In this article, we looked at how Apache JMeter™ can be used to construct and execute test plans to place loads on our servers so that we can analyze the results and gain an understanding of how well our servers perform. Resources for Article: Further resources on this subject: Geo-Spatial Data in Python: Working with Geometry [article] Working with Geo-Spatial Data in Python [article] Getting Started with GeoServer [article]
Read more
  • 0
  • 0
  • 1903

article-image-decoupling-units-unittestmock
Packt
24 Nov 2014
27 min read
Save for later

Decoupling Units with unittest.mock

Packt
24 Nov 2014
27 min read
In this article by Daniel Arbuckle, author of the book Learning Python Testing, you'll learn how by using the unittest.mock package, you can easily perform the following: Replace functions and objects in your own code or in external packages. Control how replacement objects behave. You can control what return values they provide, whether they raise an exception, even whether they make any calls to other functions, or create instances of other objects. Check whether the replacement objects were used as you expected: whether functions or methods were called the correct number of times, whether the calls occurred in the correct order, and whether the passed parameters were correct. (For more resources related to this topic, see here.) Mock objects in general All right, before we get down to the nuts and bolts of unittest.mock, let's spend a few moments talking about mock objects overall. Broadly speaking, mock objects are any objects that you can use as substitutes in your test code, to keep your tests from overlapping and your tested code from infiltrating the wrong tests. However, like most things in programming, the idea works better when it has been formalized into a well-designed library that you can call on when you need it. There are many such libraries available for most programming languages. Over time, the authors of mock object libraries have developed two major design patterns for mock objects: in one pattern, you can create a mock object and perform all of the expected operations on it. The object records these operations, and then you put the object into playback mode and pass it to your code. If your code fails to duplicate the expected operations, the mock object reports a failure. In the second pattern, you can create a mock object, do the minimal necessary configuration to allow it to mimic the real object it replaces, and pass it to your code. It records how the code uses it, and then you can perform assertions after the fact to check whether your code used the object as expected. The second pattern is slightly more capable in terms of the tests that you can write using it but, overall, either pattern works well. Mock objects according to unittest.mock Python has several mock object libraries; as of Python 3.3, however, one of them has been crowned as a member of the standard library. Naturally that's the one we're going to focus on. That library is, of course, unittest.mock. The unittest.mock library is of the second sort, a record-actual-use-and-then-assert library. The library contains several different kinds of mock objects that, between them, let you mock almost anything that exists in Python. Additionally, the library contains several useful helpers that simplify assorted tasks related to mock objects, such as temporarily replacing real objects with mocks. Standard mock objects The basic element of unittest.mock is the unittest.mock.Mock class. Even without being configured at all, Mock instances can do a pretty good job of pretending to be some other object, method, or function. There are many mock object libraries for Python; so, strictly speaking, the phrase "mock object" could mean any object that was created by any of these libraries. Mock objects can pull off this impersonation because of a clever, somewhat recursive trick. When you access an unknown attribute of a mock object, instead of raising an AttributeError exception, the mock object creates a child mock object and returns that. Since mock objects are pretty good at impersonating other objects, returning a mock object instead of the real value works at least in the common case. Similarly, mock objects are callable; when you call a mock object as a function or method, it records the parameters of the call and then, by default, returns a child mock object. A child mock object is a mock object in its own right, but it knows that it's connected to the mock object it came from—its parent. Anything you do to the child is also recorded in the parent's memory. When the time comes to check whether the mock objects were used correctly, you can use the parent object to check on all of its descendants. Example: Playing with mock objects in the interactive shell (try it for yourself!): $ python3.4 Python 3.4.0 (default, Apr 2 2014, 08:10:08) [GCC 4.8.2] on linux Type "help", "copyright", "credits" or "license" for more information. >>> from unittest.mock import Mock, call >>> mock = Mock() >>> mock.x <Mock name='mock.x' id='140145643647832'> >>> mock.x <Mock name='mock.x' id='140145643647832'> >>> mock.x('Foo', 3, 14) <Mock name='mock.x()' id='140145643690640'> >>> mock.x('Foo', 3, 14) <Mock name='mock.x()' id='140145643690640'> >>> mock.x('Foo', 99, 12) <Mock name='mock.x()' id='140145643690640'> >>> mock.y(mock.x('Foo', 1, 1)) <Mock name='mock.y()' id='140145643534320'> >>> mock.method_calls [call.x('Foo', 3, 14), call.x('Foo', 3, 14), call.x('Foo', 99, 12), call.x('Foo', 1, 1), call.y(<Mock name='mock.x()' id='140145643690640'>)] >>> mock.assert_has_calls([call.x('Foo', 1, 1)]) >>> mock.assert_has_calls([call.x('Foo', 1, 1), call.x('Foo', 99, 12)]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib64/python3.4/unittest/mock.py", line 792, in assert_has_ calls ) from cause AssertionError: Calls not found. Expected: [call.x('Foo', 1, 1), call.x('Foo', 99, 12)] Actual: [call.x('Foo', 3, 14), call.x('Foo', 3, 14), call.x('Foo', 99, 12), call.x('Foo', 1, 1), call.y(<Mock name='mock.x()' id='140145643690640'>)] >>> mock.assert_has_calls([call.x('Foo', 1, 1), ... call.x('Foo', 99, 12)], any_order = True) >>> mock.assert_has_calls([call.y(mock.x.return_value)]) There are several important things demonstrated in this interactive session. First, notice that the same mock object was returned each time that we accessed mock.x. This always holds true: if you access the same attribute of a mock object, you'll get the same mock object back as the result. The next thing to notice might seem more surprising. Whenever you call a mock object, you get the same mock object back as the return value. The returned mock isn't made new for every call, nor is it unique for each combination of parameters. We'll see how to override the return value shortly but, by default, you get the same mock object back every time you call a mock object. This mock object can be accessed using the return_value attribute name, as you might have noticed from the last statement of the example. The unittest.mock package contains a call object that helps to make it easier to check whether the correct calls have been made. The call object is callable, and takes note of its parameters in a way similar to mock objects, making it easy to compare it to a mock object's call history. However, the call object really shines when you have to check for calls to descendant mock objects. As you can see in the previous example, while call('Foo', 1, 1) will match a call to the parent mock object, if the call used these parameters, call.x('Foo', 1, 1), it matches a call to the child mock object named x. You can build up a long chain of lookups and invocations. For example: >>> mock.z.hello(23).stuff.howdy('a', 'b', 'c') <Mock name='mock.z.hello().stuff.howdy()' id='140145643535328'> >>> mock.assert_has_calls([ ... call.z.hello().stuff.howdy('a', 'b', 'c') ... ]) >>> Notice that the original invocation included hello(23), but the call specification wrote it simply as hello(). Each call specification is only concerned with the parameters of the object that was finally called after all of the lookups. The parameters of intermediate calls are not considered. That's okay because they always produce the same return value anyway unless you've overridden that behavior, in which case they probably don't produce a mock object at all. You might not have encountered an assertion before. Assertions have one job, and one job only: they raise an exception if something is not as expected. The assert_has_calls method, in particular, raises an exception if the mock object's history does not include the specified calls. In our example, the call history matches, so the assertion method doesn't do anything visible. You can check whether the intermediate calls were made with the correct parameters, though, because the mock object recorded a call immediately to mock.z.hello(23) before it recorded a call to mock.z.hello().stuff.howdy('a', 'b', 'c'): >>> mock.mock_calls.index(call.z.hello(23)) 6 >>> mock.mock_calls.index(call.z.hello().stuff.howdy('a', 'b', 'c')) 7 This also points out the mock_calls attribute that all mock objects carry. If the various assertion functions don't quite do the trick for you, you can always write your own functions that inspect the mock_calls list and check whether things are or are not as they should be. We'll discuss the mock object assertion methods shortly. Non-mock attributes What if you want a mock object to give back something other than a child mock object when you look up an attribute? It's easy; just assign a value to that attribute: >>> mock.q = 5 >>> mock.q 5 There's one other common case where mock objects' default behavior is wrong: what if accessing a particular attribute is supposed to raise an AttributeError? Fortunately, that's easy too: >>> del mock.w >>> mock.w Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib64/python3.4/unittest/mock.py", line 563, in __getattr__ raise AttributeError(name) AttributeError: w Non-mock return values and raising exceptions Sometimes, actually fairly often, you'll want mock objects posing as functions or methods to return a specific value, or a series of specific values, rather than returning another mock object. To make a mock object always return the same value, just change the return_value attribute: >>> mock.o.return_value = 'Hi' >>> mock.o() 'Hi' >>> mock.o('Howdy') 'Hi' If you want the mock object to return different value each time it's called, you need to assign an iterable of return values to the side_effect attribute instead, as follows: >>> mock.p.side_effect = [1, 2, 3] >>> mock.p() 1 >>> mock.p() 2 >>> mock.p() 3 >>> mock.p() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib64/python3.4/unittest/mock.py", line 885, in __call__ return _mock_self._mock_call(*args, **kwargs) File "/usr/lib64/python3.4/unittest/mock.py", line 944, in _mock_call result = next(effect) StopIteration If you don't want your mock object to raise a StopIteration exception, you need to make sure to give it enough return values for all of the invocations in your test. If you don't know how many times it will be invoked, an infinite iterator such as itertools.count might be what you need. This is easily done: >>> mock.p.side_effect = itertools.count() If you want your mock to raise an exception instead of returning a value, just assign the exception object to side_effect, or put it into the iterable that you assign to side_effect: >>> mock.e.side_effect = [1, ValueError('x')] >>> mock.e() 1 >>> mock.e() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib64/python3.4/unittest/mock.py", line 885, in __call__ return _mock_self._mock_call(*args, **kwargs) File "/usr/lib64/python3.4/unittest/mock.py", line 946, in _mock_call raise result ValueError: x The side_effect attribute has another use, as well that we'll talk about. Mocking class or function details Sometimes, the generic behavior of mock objects isn't a close enough emulation of the object being replaced. This is particularly the case when it's important that they raise exceptions when used improperly, since mock objects are usually happy to accept any usage. The unittest.mock package addresses this problem using a technique called speccing. If you pass an object into unittest.mock.create_autospec, the returned value will be a mock object, but it will do its best to pretend that it's the same object you passed into create_autospec. This means that it will: Raise an AttributeError if you attempt to access an attribute that the original object doesn't have, unless you first explicitly assign a value to that attribute Raise a TypeError if you attempt to call the mock object when the original object wasn't callable Raise a TypeError if you pass the wrong number of parameters or pass a keyword parameter that isn't viable if the original object was callable Trick isinstance into thinking that the mock object is of the original object's type Mock objects made by create_autospec share this trait with all of their children as well, which is usually what you want. If you really just want a specific mock to be specced, while its children are not, you can pass the template object into the Mock constructor using the spec keyword. Here's a short demonstration of using create_autospec: >>> from unittest.mock import create_autospec >>> x = Exception('Bad', 'Wolf') >>> y = create_autospec(x) >>> isinstance(y, Exception) True >>> y <NonCallableMagicMock spec='Exception' id='140440961099088'> Mocking function or method side effects Sometimes, for a mock object to successfully take the place of a function or method means that the mock object has to actually perform calls to other functions, or set variable values, or generally do whatever a function can do. This need is less common than you might think, and it's also somewhat dangerous for testing purposes because, when your mock objects can execute arbitrary code, there's a possibility that they stop being a simplifying tool for enforcing test isolation, and become a complex part of the problem instead. Having said that, there are still times when you need a mocked function to do something more complex than simply returning a value, and we can use the side_effect attribute of mock objects to achieve this. We've seen side_effect before, when we assigned an iterable of return values to it. If you assign a callable to side_effect, this callable will be called when the mock object is called and passed the same parameters. If the side_effect function raises an exception, this is what the mock object does as well; otherwise, the side_effect return value is returned by the mock object. In other words, if you assign a function to a mock object's side_effect attribute, this mock object in effect becomes that function with the only important difference being that the mock object still records the details of how it's used. The code in a side_effect function should be minimal, and should not try to actually do the job of the code the mock object is replacing. All it should do is perform any expected externally visible operations and then return the expected result.Mock object assertion methods As we saw in the Standard mock objects section, you can always write code that checks the mock_calls attribute of mock objects to see whether or not things are behaving as they should. However, there are some particularly common checks that have already been written for you, and are available as assertion methods of the mock objects themselves. As is normal for assertions, these assertion methods return None if they pass, and raise an AssertionError if they fail. The assert_called_with method accepts an arbitrary collection of arguments and keyword arguments, and raises an AssertionError unless these parameters were passed to the mock the last time it was called. The assert_called_once_with method behaves like assert_called_with, except that it also checks whether the mock was only called once and raises AssertionError if that is not true. The assert_any_call method accepts arbitrary arguments and keyword arguments, and raises an AssertionError if the mock object has never been called with these parameters. We've already seen the assert_has_calls method. This method accepts a list of call objects, checks whether they appear in the history in the same order, and raises an exception if they do not. Note that "in the same order" does not necessarily mean "next to each other." There can be other calls in between the listed calls as long as all of the listed calls appear in the proper sequence. This behavior changes if you assign a true value to the any_order argument. In that case, assert_has_calls doesn't care about the order of the calls, and only checks whether they all appear in the history. The assert_not_called method raises an exception if the mock has ever been called. Mocking containers and objects with a special behavior One thing the Mock class does not handle is the so-called magic methods that underlie Python's special syntactic constructions: __getitem__, __add__, and so on. If you need your mock objects to record and respond to magic methods—in other words, if you want them to pretend to be container objects such as dictionaries or lists, or respond to mathematical operators, or act as context managers or any of the other things where syntactic sugar translates it into a method call underneath—you're going to use unittest.mock.MagicMock to create your mock objects. There are a few magic methods that are not supported even by MagicMock, due to details of how they (and mock objects) work: __getattr__, __setattr__, __init__ , __new__, __prepare__, __instancecheck__, __subclasscheck__, and __del__. Here's a simple example in which we use MagicMock to create a mock object supporting the in operator: >>> from unittest.mock import MagicMock >>> mock = MagicMock() >>> 7 in mock False >>> mock.mock_calls [call.__contains__(7)] >>> mock.__contains__.return_value = True >>> 8 in mock True >>> mock.mock_calls [call.__contains__(7), call.__contains__(8)] Things work similarly with the other magic methods. For example, addition: >>> mock + 5 <MagicMock name='mock.__add__()' id='140017311217816'> >>> mock.mock_calls [call.__contains__(7), call.__contains__(8), call.__add__(5)] Notice that the return value of the addition is a mock object, a child of the original mock object, but the in operator returned a Boolean value. Python ensures that some magic methods return a value of a particular type, and will raise an exception if that requirement is not fulfilled. In these cases, MagicMock's implementations of the methods return a best-guess value of the proper type, instead of a child mock object. There's something you need to be careful of when it comes to the in-place mathematical operators, such as += (__iadd__) and |= (__ior__), and that is the fact that MagicMock handles them somewhat strangely. What it does is still useful, but it might well catch you by surprise: >>> mock += 10 >>> mock.mock_calls [] What was that? Did it erase our call history? Fortunately, no, it didn't. What it did was assign the child mock created by the addition operation to the variable called mock. That is entirely in accordance with how the in-place math operators are supposed to work. Unfortunately, it has still cost us our ability to access the call history, since we no longer have a variable pointing at the parent mock object. Make sure that you have the parent mock object set aside in a variable that won't be reassigned, if you're going to be checking in-place math operators. Also, you should make sure that your mocked in-place operators return the result of the operation, even if that just means return self.return_value, because otherwise Python will assign None to the left-hand variable. There's another detailed way in which in-place operators work that you should keep in mind: >>> mock = MagicMock() >>> x = mock >>> x += 5 >>> x <MagicMock name='mock.__iadd__()' id='139845830142216'> >>> x += 10 >>> x <MagicMock name='mock.__iadd__().__iadd__()' id='139845830154168'> >>> mock.mock_calls [call.__iadd__(5), call.__iadd__().__iadd__(10)] Because the result of the operation is assigned to the original variable, a series of in-place math operations builds up a chain of child mock objects. If you think about it, that's the right thing to do, but it is rarely what people expect at first. Mock objects for properties and descriptors There's another category of things that basic Mock objects don't do a good job of emulating: descriptors. Descriptors are objects that allow you to interfere with the normal variable access mechanism. The most commonly used descriptors are created by Python's property built-in function, which simply allows you to write functions to control getting, setting, and deleting a variable. To mock a property (or other descriptor), create a unittest.mock.PropertyMock instance and assign it to the property name. The only complication is that you can't assign a descriptor to an object instance; you have to assign it to the object's type because descriptors are looked up in the type without first checking the instance. That's not hard to do with mock objects, fortunately: >>> from unittest.mock import PropertyMock >>> mock = Mock() >>> prop = PropertyMock() >>> type(mock).p = prop >>> mock.p <MagicMock name='mock()' id='139845830215328'> >>> mock.mock_calls [] >>> prop.mock_calls [call()] >>> mock.p = 6 >>> prop.mock_calls [call(), call(6)] The thing to be mindful of here is that the property is not a child of the object named mock. Because of this, we have to keep it around in its own variable because otherwise we'd have no way of accessing its history. The PropertyMock objects record variable lookup as a call with no parameters, and variable assignment as a call with the new value as a parameter. You can use a PropertyMock object if you actually need to record variable accesses in your mock object history. Usually you don't need to do that, but the option exists. Even though you set a property by assigning it to an attribute of a type, you don't have to worry about having your PropertyMock objects bleed over into other tests. Each Mock you create has its own type object, even though they all claim to be of the same class: >>> type(Mock()) is type(Mock()) False Thanks to this feature, any changes that you make to a mock object's type object are unique to that specific mock object. Mocking file objects It's likely that you'll occasionally need to replace a file object with a mock object. The unittest.mock library helps you with this by providing mock_open, which is a factory for fake open functions. These functions have the same interface as the real open function, but they return a mock object that's been configured to pretend that it's an open file object. This sounds more complicated than it is. See for yourself: >>> from unittest.mock import mock_open >>> open = mock_open(read_data = 'moose') >>> with open('/fake/file/path.txt', 'r') as f: ... print(f.read()) ... moose If you pass a string value to the read_data parameter, the mock file object that eventually gets created will use that value as the data source when its read methods get called. As of Python 3.4.0, read_data only supports string objects, not bytes. If you don't pass read_data, read method calls will return an empty string. The problem with the previous code is that it makes the real open function inaccessible, and leaves a mock object lying around where other tests might stumble over it. Read on to see how to fix these problems. Replacing real code with mock objects The unittest.mock library gives a very nice tool for temporarily replacing objects with mock objects, and then undoing the change when our test is done. This tool is unittest.mock.patch. There are a lot of different ways in which that patch can be used: it works as a context manager, a function decorator, and a class decorator; additionally, it can create a mock object to use for the replacement or it can use the replacement object that you specify. There are a number of other optional parameters that can further adjust the behavior of the patch. Basic usage is easy: >>> from unittest.mock import patch, mock_open >>> with patch('builtins.open', mock_open(read_data = 'moose')) as mock: ... with open('/fake/file.txt', 'r') as f: ... print(f.read()) ... moose >>> open <built-in function open> As you can see, patch dropped the mock open function created by mock_open over the top of the real open function; then, when we left the context, it replaced the original for us automatically. The first parameter of patch is the only one that is required. It is a string describing the absolute path to the object to be replaced. The path can have any number of package and subpackage names, but it must include the module name and the name of the object inside the module that is being replaced. If the path is incorrect, patch will raise an ImportError, TypeError, or AttributeError, depending on what exactly is wrong with the path. If you don't want to worry about making a mock object to be the replacement, you can just leave that parameter off: >>> import io >>> with patch('io.BytesIO'): ... x = io.BytesIO(b'ascii data') ... io.BytesIO.mock_calls [call(b'ascii data')] The patch function creates a new MagicMock for you if you don't tell it what to use for the replacement object. This usually works pretty well, but you can pass the new parameter (also the second parameter, as we used it in the first example of this section) to specify that the replacement should be a particular object; or you can pass the new_callable parameter to make patch use the value of that parameter to create the replacement object. We can also force the patch to use create_autospec to make the replacement object, by passing autospec=True: >>> with patch('io.BytesIO', autospec = True): ... io.BytesIO.melvin Traceback (most recent call last): File "<stdin>", line 2, in <module> File "/usr/lib64/python3.4/unittest/mock.py", line 557, in __getattr__ raise AttributeError("Mock object has no attribute %r" % name) AttributeError: Mock object has no attribute 'melvin' The patch function will normally refuse to replace an object that does not exist; however, if you pass it create=True, it will happily drop a mock object wherever you like. Naturally, this is not compatible with autospec=True. The patch function covers the most common cases. There are a few related functions that handle less common but still useful cases. The patch.object function does the same thing as patch, except that, instead of taking the path string, it accepts an object and an attribute name as its first two parameters. Sometimes this is more convenient than figuring out the path to an object. Many objects don't even have valid paths (for example, objects that exist only in a function local scope), although the need to patch them is rarer than you might think. The patch.dict function temporarily drops one or more objects into a dictionary under specific keys. The first parameter is the target dictionary; the second is a dictionary from which to get the key and value pairs to put into the target. If you pass clear=True, the target will be emptied before the new values are inserted. Notice that patch.dict doesn't create the replacement values for you. You'll need to make your own mock objects, if you want them. Mock objects in action That was a lot of theory interspersed with unrealistic examples. Let's take a look at what we've learned and apply it for a more realistic view of how these tools can help us. Better PID tests The PID tests suffered mostly from having to do a lot of extra work to patch and unpatch time.time, and had some difficulty breaking the dependence on the constructor. Patching time.time Using patch, we can remove a lot of the repetitiveness of dealing with time.time; this means that it's less likely that we'll make a mistake somewhere, and saves us from spending time on something that's kind of boring and annoying. All of the tests can benefit from similar changes: >>> from unittest.mock import Mock, patch >>> with patch('time.time', Mock(side_effect = [1.0, 2.0, 3.0, 4.0, 5.0])): ... import pid ... controller = pid.PID(P = 0.5, I = 0.5, D = 0.5, setpoint = 0, ... initial = 12) ... assert controller.gains == (0.5, 0.5, 0.5) ... assert controller.setpoint == [0.0] ... assert controller.previous_time == 1.0 ... assert controller.previous_error == -12.0 ... assert controller.integrated_error == 0.0 Apart from using patch to handle time.time, this test has been changed. We can now use assert to check whether things are correct instead of having doctest compare the values directly. There's hardly any difference between the two approaches, except that we can place the assert statements inside the context managed by patch. Decoupling from the constructor Using mock objects, we can finally separate the tests for the PID methods from the constructor, so that mistakes in the constructor cannot affect the outcome: >>> with patch('time.time', Mock(side_effect = [2.0, 3.0, 4.0, 5.0])): ... pid = imp.reload(pid) ... mock = Mock() ... mock.gains = (0.5, 0.5, 0.5) ... mock.setpoint = [0.0] ... mock.previous_time = 1.0 ... mock.previous_error = -12.0 ... mock.integrated_error = 0.0 ... assert pid.PID.calculate_response(mock, 6) == -3.0 ... assert pid.PID.calculate_response(mock, 3) == -4.5 ... assert pid.PID.calculate_response(mock, -1.5) == -0.75 ... assert pid.PID.calculate_response(mock, -2.25) == -1.125 What we've done here is set up a mock object with the proper attributes, and pass it into calculate_response as the self-parameter. We could do this because we didn't create a PID instance at all. Instead, we looked up the method's function inside the class and called it directly, allowing us to pass whatever we wanted as the self-parameter instead of having Python's automatic mechanisms handle it. Never invoking the constructor means that we're immune to any errors it might contain, and guarantees that the object state is exactly what we expect here in our calculate_response test. Summary In this article, we've learned about a family of objects that specialize in impersonating other classes, objects, methods, and functions. We've seen how to configure these objects to handle corner cases where their default behavior isn't sufficient, and we've learned how to examine the activity logs that these mock objects keep, so that we can decide whether the objects are being used properly or not. Resources for Article: Further resources on this subject: Installing NumPy, SciPy, matplotlib, and IPython [Article] Machine Learning in IPython with scikit-learn [Article] Python 3: Designing a Tasklist Application [Article]
Read more
  • 0
  • 0
  • 2746

article-image-function-passing
Packt
19 Nov 2014
6 min read
Save for later

Function passing

Packt
19 Nov 2014
6 min read
In this article by Simon Timms, the author of the book, Mastering JavaScript Design Patterns, we will cover function passing. In functional programming languages, functions are first-class citizens. Functions can be assigned to variables and passed around just like you would with any other variable. This is not entirely a foreign concept. Even languages such as C had function pointers that could be treated just like other variables. C# has delegates and, in more recent versions, lambdas. The latest release of Java has also added support for lambdas, as they have proven to be so useful. (For more resources related to this topic, see here.) JavaScript allows for functions to be treated as variables and even as objects and strings. In this way, JavaScript is functional in nature. Because of JavaScript's single-threaded nature, callbacks are a common convention and you can find them pretty much everywhere. Consider calling a function at a later date on a web page. This is done by setting a timeout on the window object as follows: setTimeout(function(){alert("Hello from the past")}, 5 * 1000); The arguments for the set timeout function are a function to call and a time to delay in milliseconds. No matter the JavaScript environment in which you're working, it is almost impossible to avoid functions in the shape of callbacks. The asynchronous processing model of Node.js is highly dependent on being able to call a function and pass in something to be completed at a later date. Making calls to external resources in a browser is also dependent on a callback to notify the caller that some asynchronous operation has completed. In basic JavaScript, this looks like the following code: var xmlhttp = new XMLHttpRequest()xmlhttp.onreadystatechange=function()if (xmlhttp.readyState==4 &&xmlhttp.status==200){//process returned data}};xmlhttp.open("GET", http://some.external.resource, true); xmlhttp.send(); You may notice that we assign onreadystatechange before we even send the request. This is because assigning it later may result in a race condition in which the server responds before the function is attached to the ready state change. In this case, we've used an inline function to process the returned data. Because functions are first class citizens, we can change this to look like the following code: var xmlhttp;function requestData(){xmlhttp = new XMLHttpRequest()xmlhttp.onreadystatechange=processData;xmlhttp.open("GET", http://some.external.resource, true); xmlhttp.send();}function processData(){if (xmlhttp.readyState==4 &&xmlhttp.status==200){   //process returned data}} This is typically a cleaner approach and avoids performing complex processing in line with another function. However, you might be more familiar with the jQuery version of this, which looks something like this: $.getJSON('http://some.external.resource', function(json){//process returned data}); In this case, the boiler plate of dealing with ready state changes is handled for you. There is even convenience provided for you should the request for data fail with the following code: $.ajax('http://some.external.resource',{ success: function(json){   //process returned data},error: function(){   //process failure},dataType: "json"}); In this case, we've passed an object into the ajax call, which defines a number of properties. Amongst these properties are function callbacks for success and failure. This method of passing numerous functions into another suggests a great way of providing expansion points for classes. Likely, you've seen this pattern in use before without even realizing it. Passing functions into constructors as part of an options object is a commonly used approach to providing extension hooks in JavaScript libraries. Implementation In Westeros, the tourism industry is almost nonextant. There are great difficulties with bandits killing tourists and tourists becoming entangled in regional conflicts. Nonetheless, some enterprising folks have started to advertise a grand tour of Westeros in which they will take those with the means on a tour of all the major attractions. From King's Landing to Eyrie, to the great mountains of Dorne, the tour will cover it all. In fact, a rather mathematically inclined member of the tourism board has taken to calling it a Hamiltonian tour, as it visits everywhere once. The HamiltonianTour class provides an options object that allows the definition of an options object. This object contains the various places to which a callback can be attached. In our case, the interface for it would look something like the following code: export class HamiltonianTourOptions{onTourStart: Function;onEntryToAttraction: Function;onExitFromAttraction: Function;onTourCompletion: Function;} The full HamiltonianTour class looks like the following code: var HamiltonianTour = (function () {function HamiltonianTour(options) {   this.options = options;}HamiltonianTour.prototype.StartTour = function () {   if (this.options.onTourStart&&typeof (this.options.onTourStart)    === "function")   this.options.onTourStart();   this.VisitAttraction("King's Landing");   this.VisitAttraction("Winterfell");   this.VisitAttraction("Mountains of Dorne");   this.VisitAttraction("Eyrie");   if (this.options.onTourCompletion&&typeof    (this.options.onTourCompletion) === "function")   this.options.onTourCompletion();}; HamiltonianTour.prototype.VisitAttraction = function (AttractionName) {   if (this.options.onEntryToAttraction&&typeof    (this.options.onEntryToAttraction) === "function")   this.options.onEntryToAttraction(AttractionName);    //do whatever one does in a Attraction   if (this.options.onExitFromAttraction&&typeof    (this.options.onExitFromAttraction) === "function")   this.options.onExitFromAttraction(AttractionName);};return HamiltonianTour;})(); You can see in the highlighted code how we check the options and then execute the callback as needed. This can be done by simply using the following code: var tour = new HamiltonianTour({onEntryToAttraction: function(cityname){console.log("I'm delighted to be in " + cityname)}});tour.StartTour(); The output of the preceding code will be: I'm delighted to be in King's LandingI'm delighted to be in WinterfellI'm delighted to be in Mountains of DorneI'm delighted to be in Eyrie Summary In this article, we have learned about function passing. Passing functions is a great approach to solving a number of problems in JavaScript and tends to be used extensively by libraries such as jQuery and frameworks such as Express. It is so commonly adopted that using it provides to added barriers no your code's readability. Resources for Article: Further resources on this subject: Creating Java EE Applications [article] Meteor.js JavaScript Framework: Why Meteor Rocks! [article] Dart with JavaScript [article]
Read more
  • 0
  • 0
  • 1996
article-image-searching-and-resolving-conflicts
Packt
18 Nov 2014
11 min read
Save for later

Searching and Resolving Conflicts

Packt
18 Nov 2014
11 min read
This article, by Eric Pidoux, author of Git Best Practices Guide, covers a part of Git that you will definitely meet: conflicts. How can we resolve them? (For more resources related to this topic, see here.) While working together as a team on a project, you will work on the same files. The pull command won't work because there are conflicts, and you might have tried some Git commands and things got bad. In this chapter, we will find solutions to these conflicts and see how we can fix them. We will cover the following topics: Finding content inside your Git repository Stashing your changes Fixing errors by practical examples Finding content inside your repository Sometimes, you will need to find something inside all your files. You can, of course, find it with the search feature of your OS, but Git already knows all your files. Searching file content To search text inside your files, simply use the following command: Erik@server:~$ git grep "Something to find" Erik@server:~$ git grep -n body Master:Website.Index.html:4:       <bodyMaster:Website.Index.html:12:       </body> It will display every match to the given keyword inside your code. All lines use the [commitref]:[filepath]:[linenumber]:[matchingcontent] pattern. Notice that [commitref] isn't displayed on all Git versions. You can also specify the commit references that grep will use to search the keyword: Erik@server:~$ git grep -n body d32lf56 p88e03d HEAD~3 Master:Website.Index.html:4:       <body> Master:Website.Index.html:12:       </body> In this case, grep will look into the d32lf56, p88e03d, and third commit starting by the head pointer. Your repository has to be encoded in UTF-8; otherwise, the grep command won't work. Git allows you to use regex inside the search feature by replacing somethingToFind with a regex. You can use the logical operators (or and and), as shown in the following command: Erik@server:~$ git grep -e myRegex1 --or -e myRegex2 Erik@server:~$ git grep -e myRegex1 --and -e myRegex2 Let's see this with an example. We only have a test.html page inside our last commit, and we want to find whether or not there is a word with an uppercase alphabetic value and numeric values: Erik@server:~$ git grep -e [A-Z] --and -e [0-9] HEAD Master:Website.Test.html:6:       TEST01 With the grep command, you can delve deeper, but it's not necessary to discuss this topic here because you won't use it every day! Showing the current status The git status command is helpful if you have to analyze your repository: Erik@server:~$ git status # On branch master # Your branch is ahead of 'origin/master' by 2 commits # (use "git push" to publish your local commits) # Changes not staged for commit: #   (use "git add<file>..." to update what will be committed) #   (use "git checkout -- <file>..." to discard changes in working directory) # # modified:   myFile1 # modified:   myFile2 # # Untracked files: #   (use "git add<file>..." to include in what will be committed) # # newFile.txt no changes added to commit (use "git add" and/or "git commit -a") Git analyzes the local repository in comparison to the remote repository. In this case, you have to add newFile.txt, commit myFile1 and myFile2, and push them to the remote repository. Exploring the repository history The best way to explore the past commits inside your repository is to use the git log command. For this part, we will assume that there are only two commits. To display all commits, use the following commands: Erik@server:~$ git log --all Commit xxxxxxxxxxx Author: Jim <[email protected]> Date: Sun Jul 20 15:10:12 2014 -0300 Fix front bugs on banner   Commit xxxxxxxxxxx Author: Erik <[email protected]> Date: Sat Jul 19 07:06:14 2014 -0300 Add the crop feature on website backend This is probably not what you want. After several days of work, you will have plenty of these commits, so how will you filter it? The power of the git log command is that you can quickly find anything in all commits. Let's go for a quick overview of what Git is able to find. We will start by finding the last commit: Erik@server:~$ git log -1 Commit xxxxxxxxxxx Author: Jim <[email protected]> Date: Sun Jul 20 15:10:12 2014 -0300 Fix front bugs on banner The number after the git log command indicates that it is the first commit from Head. Too easy! Let's try to find what the last commit of Erik is: Erik@server:~$ git log --author=Erik -1 Commit xxxxxxxxxxx Author: Erik <[email protected]> Date: Sat Jul 19 07:06:14 2014 -0300 Add the crop feature on website backend Now, let's find it between two dates: Erik@server:~$ git log --author=Erik --before "2014-07-20" --after "2014-07-18" Commit xxxxxxxxxxx Author: Erik <[email protected]> Date: Sat Jul 19 07:06:14 2014 -0300 Add the crop feature on website backend As I told you earlier, there are a lot of parameters to the git log command. You can see all of them using the git help log command. The stat parameter is really useful: Erik@server:~$ git log --author=Jim --stat Commit xxxxxxxxxxx Author: Jim <[email protected]> Date: Sun Jul 20 15:10:12 2014 -0300 Fix front bugs on banner   index.php | 1 +      1 file changed, 1 insertion(+) This parameter allows you to view a summary of the changes made in each commit. If you want to see the full changes, try the -p parameter. Remember that the git log command has a file parameter to restrict the search to the git log [file] file. Viewing changes There are two ways to see changes in a repository: git diff and git show. The git diff command lets you see the changes that are not committed. For example, we have an index.phpfile and replace the file content by a line. Just before the lines, you will see a plus (+) or minus (-) sign. The + sign means that content was added and the – sign denotes that it was removed: Erik@server:~$ git diff diff --git a/index.php b/index.php indexb4d22ea..748ebb2 100644 --- a/index.php +++ b/index.php @@ -1,11 +1 @@ -<html> - -<head> -<title>Git is great!</title> -</head> -<body> -<?php - echo 'Git is great'; -?> -</body> -</html> +<b> I added a line</b> If you want to analyze a commit, I suggest you to use the git show command. It will display the full list of changes of the commit: Erik@server:~$ git show commitId There is a way to do the opposite, that is, to display commits for a file with git blame: Erik@server:~$ git blameindex.php e4bac680 (Erik 2014-07-20 19:00:47 +0200 1) <b> I added a line</b> Cleaning your mistakes The first thing to know is that you can always clean your mistake with Git. Sometimes this will be hard or painful for your code, but you can do it! Let's start this section with how to remove untracked files: Erik@server:~$ git clean -n The –n option will make a dry-run (it's always important to see what will happen before you regret it). If you want to also remove directories and hidden files, use this one: Erik@server:~$ git clean -fdx With these options, you will delete new directories (-d) and hidden files (-x) and be able to force them (-f). The git reset command The git reset command will allow you to go back to a previous state (for example, commit). The git reset command has three options (soft, hard, or mixed, by default). In general, the git reset command's aim is to take the current branch, reset it to point somewhere else, and possibly bring the index and work tree along. More concretely, if the master branch (currently checked out) looks like the first row (in the following figure) and you want it to point to B and not C, you will use this command: Erik@server:~$ git reset B The following diagram shows exactly what happened with the previous command. The HEAD pointer was reset from C to B: The following table explains what the options really move: Option Head pointer Working tree Staging area Soft Yes No No Mixed Yes No Yes Hard Yes Yes Yes The three options that you can provide on the reset command can be easily explained: --hard: This option is the simplest. It will restore the content to the given commit. All the local changes will be erased. The git reset --hard command means git reset --hard HEAD, which will reset your files to the previous version and erase your local changes. --mixed: This option resets the index, but not the work tree. It will reset your local files, but the differences found during the process will be marked as local modifications if you analyze them using git status. It's very helpful if you make some bugs on previous commits and want to keep your local changes. --soft: This option will keep all your files, such as mixed, intact. If you use git status, it will appear as changes to commit. You can use this option when you have not committed files as expected, but your work is correct. So you just have to recommit it the way you want. The git reset command doesn't remove untracked files; use git clean instead. Canceling a commit The git revert command allows you to "cancel" your last unpushed commit. I used quotes around cancel because Git doesn't drop the commit; it creates a new commit that executes the opposite of your commit. A pushed commit is irreversible, so you cannot change it. Firstly, let's have a look at the last commits: Erik@server:~$ git log commite4bac680c5818c70ced1205cfc46545d48ae687e Author: Eric Pidoux Date:   Sun Jul 20 19:00:47 2014 +0200 replace all commit0335a5f13b937e8367eff35d78c259cf2c4d10f7 Author: Eric Pidoux Date:   Sun Jul 20 18:23:06 2014 +0200 commitindex.php We want to cancel the 0335… commit: Erik@server:~$ git revert 0335a5f13 Canceling this commit isn't necessary to enter the full commit ID, but just the first characters. Git will find it, but you will have to enter at least six characters to be sure that there isn't another commit that starts with the same characters. Solving merge conflicts When you are working with several branches, a conflict will probably occur while merging them. It appears if two commits from different branches modify the same content and Git isn't able to merge them. If it occurs, Git will mark the conflict and you have to resolve it. For example, Jim modified the index.html file on a feature branch and Erik has to edit it on another branch. When Erik merges the two branches, the conflict occurs. Git will tell you to edit the file to resolve the conflict. In this file, you will find the following: <<<<<<< HEAD Changes from Erik ======= Changes from Jim >>>>>>> b2919weg63bfd125627gre1911c8b08127c85f8 The <<<<<<< characters indicate the start of the merge conflict, the ====== characters indicate the break points used for comparison, and >>>>>>> indicate the end of the conflict. To resolve a conflict, you have to analyze the differences between the two changes and merge them manually. Don't forget to delete the signs added by Git. After resolving it, simply commit the changes. If your merge conflict is too complicated to resolve because you can't easily find the differences, Git provides a useful tool to help you. Git's diff helps you to find differences: Diff --git erik/mergetestjim/mergetest Index.html 88h3d45..92f62w 130634 --- erik/mergetest +++ jim/mergetest @@ -1,3 +1,4 @@ <body> +I added this code between This is the file content -I added a third line of code +And this is the last one So, what happened? The command displays some lines with the changes, with the + mark coming from origin/master; those marked with – are from your local repository, and of course, the lines without a mark are common to both repositories. Summary In this article, we covered all tips and commands that are useful to fix mistakes, resolve conflicts, search inside the commit history, and so on. Resources for Article: Further resources on this subject: Configuration [Article] Parallel Dimensions – Branching with Git [Article] Issues and Wikis in GitLab [Article]
Read more
  • 0
  • 0
  • 1158

article-image-dart-javascript
Packt
18 Nov 2014
12 min read
Save for later

Dart with JavaScript

Packt
18 Nov 2014
12 min read
In this article by Sergey Akopkokhyants, author of Mastering Dart, we will combine the simplicity of jQuery and the power of Dart in a real example. (For more resources related to this topic, see here.) Integrating Dart with jQuery For demonstration purposes, we have created the js_proxy package to help the Dart code to communicate with jQuery. It is available on the pub manager at https://pub.dartlang.org/packages/js_proxy. This package is layered on dart:js and has a library of the same name and sole class JProxy. An instance of the JProxy class can be created via the generative constructor where we can specify the optional reference on the proxied JsObject: JProxy([this._object]); We can create an instance of JProxy with a named constructor and provide the name of the JavaScript object accessible through the dart:js context as follows: JProxy.fromContext(String name) { _object = js.context[name]; } The JProxy instance keeps the reference on the proxied JsObject class and makes all the manipulation on it, as shown in the following code: js.JsObject _object;    js.JsObject get object => _object; How to create a shortcut to jQuery? We can use JProxy to create a reference to jQuery via the context from the dart:js library as follows: var jquery = new JProxy.fromContext('jQuery'); Another very popular way is to use the dollar sign as a shortcut to the jQuery variable as shown in the following code: var $ = new JProxy.fromContext('jQuery'); Bear in mind that the original jQuery and $ variables from JavaScript are functions, so our variables reference to the JsFunction class. From now, jQuery lovers who moved to Dart have a chance to use both the syntax to work with selectors via parentheses. Why JProxy needs a method call? Usually, jQuery send a request to select HTML elements based on IDs, classes, types, attributes, and values of their attributes or their combination, and then performs some action on the results. We can use the basic syntax to pass the search criteria in the jQuery or $ function to select the HTML elements: $(selector) Dart has syntactic sugar method call that helps us to emulate a function and we can use the call method in the jQuery syntax. Dart knows nothing about the number of arguments passing through the function, so we use the fixed number of optional arguments in the call method. Through this method, we invoke the proxied function (because jquery and $ are functions) and returns results within JProxy: dynamic call([arg0 = null, arg1 = null, arg2 = null,    arg3 = null, arg4 = null, arg5 = null, arg6 = null,    arg7 = null, arg8 = null, arg9 = null]) { var args = []; if (arg0 != null) args.add(arg0); if (arg1 != null) args.add(arg1); if (arg2 != null) args.add(arg2); if (arg3 != null) args.add(arg3); if (arg4 != null) args.add(arg4); if (arg5 != null) args.add(arg5); if (arg6 != null) args.add(arg6); if (arg7 != null) args.add(arg7); if (arg8 != null) args.add(arg8); if (arg9 != null) args.add(arg9); return _proxify((_object as js.JsFunction).apply(args)); } How JProxy invokes jQuery? The JProxy class is a proxy to other classes, so it marks with the @proxy annotation. We override noSuchMethod intentionally to call the proxied methods and properties of jQuery when the methods or properties of the proxy are invoked. The logic flow in noSuchMethod is pretty straightforward. It invokes callMethod of the proxied JsObject when we invoke the method on proxy, or returns a value of property of the proxied object if we call the corresponding operation on proxy. The code is as follows: @override dynamic noSuchMethod(Invocation invocation) { if (invocation.isMethod) {    return _proxify(_object.callMethod(      symbolAsString(invocation.memberName),      _jsify(invocation.positionalArguments))); } else if (invocation.isGetter) {    return      _proxify(_object[symbolAsString(invocation.memberName)]); } else if (invocation.isSetter) {    throw new Exception('The setter feature was not implemented      yet.'); } return super.noSuchMethod(invocation); } As you might remember, all map or Iterable arguments must be converted to JsObject with the help of the jsify method. In our case, we call the _jsify method to check and convert passed arguments aligned with a called function, as shown in the following code: List _jsify(List params) { List res = []; params.forEach((item) {    if (item is Map || item is List) {      res.add(new js.JsObject.jsify(item));    } else {      res.add(item);    } }); return res; } Before return, the result must be passed through the _proxify function as follows: dynamic _proxify(value) {    return value is js.JsObject ? new JProxy(value) : value; } This function wraps all JsObject within a JProxy class and passes other values as it is. An example project Now create the jquery project, open the pubspec.yaml file, and add js_proxy to the dependencies. Open the jquery.html file and make the following changes: <!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1"> <title>jQuery</title> <link rel="stylesheet" href="jquery.css"> </head> <body> <h1>Jquery</h1> <p>I'm a paragraph</p> <p>Click on me to hide</p> <button>Click me</button> <div class="container"> <div class="box"></div> </div> </body> <script src="//code.jquery.com/jquery-1.11.0.min.js"></script> <script type="application/dart" src="jquery.dart"></script> <script src="packages/browser/dart.js"></script> </html> This project aims to demonstrate that: Communication is easy between Dart and JavaScript The syntax of the Dart code could be similar to the jQuery code In general, you may copy the JavaScript code, paste it in the Dart code, and probably make slightly small changes. How to get the jQuery version? It's time to add js_proxy in our code. Open jquery.dart and make the following changes: import 'dart:html'; import 'package:js_proxy/js_proxy.dart'; /** * Shortcut for jQuery. */ var $ = new JProxy.fromContext('jQuery'); /** * Shortcut for browser console object. */ var console = window.console; main() { printVersion(); } /** * jQuery code: * *   var ver = $().jquery; *   console.log("jQuery version is " + ver); * * JS_Proxy based analog: */ printVersion() { var ver = $().jquery; console.log("jQuery version is " + ver); } You should be familiar with jQuery and console shortcuts yet. The call to jQuery with empty parentheses returns JProxy and contains JsObject with reference to jQuery from JavaScript. The jQuery object has a jQuery property that contains the current version number, so we reach this one via noSuchMethod of JProxy. Run the application, and you will see the following result in the console: jQuery version is 1.11.1 Let's move on and perform some actions on the selected HTML elements. How to perform actions in jQuery? The syntax of jQuery is based on selecting the HTML elements and it also performs some actions on them: $(selector).action(); Let's select a button on the HTML page and fire the click event as shown in the following code: /** * jQuery code: * *   $("button").click(function(){ *     alert('You click on button'); *   }); * * JS_Proxy based analog: */ events() { // We remove 'function' and add 'event' here $("button").click((event) {    // Call method 'alert' of 'window'    window.alert('You click on button'); }); } All we need to do here is just remove the function keyword, because anonymous functions on Dart do not use it and add the event parameter. This is because this argument is required in the Dart version of the event listener. The code calls jQuery to find all the HTML button elements to add the click event listener to each of them. So when we click on any button, a specified alert message will be displayed. On running the application, you will see the following message: How to use effects in jQuery? The jQuery supports animation out of the box, so it sounds very tempting to use it from Dart. Let's take an example of the following code snippet: /** * jQuery code: * *   $("p").click(function() { *     this.hide("slow",function(){ *       alert("The paragraph is now hidden"); *     }); *   }); *   $(".box").click(function(){ *     var box = this; *     startAnimation(); *     function startAnimation(){ *       box.animate({height:300},"slow"); *       box.animate({width:300},"slow"); *       box.css("background-color","blue"); *       box.animate({height:100},"slow"); *       box.animate({width:100},"slow",startAnimation); *     } *   }); * * JS_Proxy based analog: */ effects() { $("p").click((event) {    $(event['target']).hide("slow",(){      window.alert("The paragraph is now hidden");    }); }); $(".box").click((event) {    var box = $(event['target']);    startAnimation() {      box.animate({'height':300},"slow");      box.animate({'width':300},"slow");      box.css("background-color","blue");      box.animate({'height':100},"slow");      box.animate({'width':100},"slow",startAnimation);    };    startAnimation(); }); } This code finds all the paragraphs on the web page to add a click event listener to each one. The JavaScript code uses the this keyword as a reference to the selected paragraph to start the hiding animation. The this keyword has a different notion on JavaScript and Dart, so we cannot use it directly in anonymous functions on Dart. The target property of event keeps the reference to the clicked element and presents JsObject in Dart. We wrap the clicked element to return a JProxy instance and use it to call the hide method. The jQuery is big enough and we have no space in this article to discover all its features, but you can find more examples at https://github.com/akserg/js_proxy. What are the performance impacts? Now, we should talk about the performance impacts of using different approaches across several modern web browsers. The algorithm must perform all the following actions: It should create 10000 DIV elements Each element should be added into the same DIV container Each element should be updated with one style All elements must be removed one by one This algorithm must be implemented in the following solutions: The clear jQuery solution on JavaScript The jQuery solution calling via JProxy and dart:js from Dart The clear Dart solution based on dart:html We implemented this algorithm on all of them, so we have a chance to compare the results and choose the champion. The following HTML code has three buttons to run independent tests, three paragraph elements to show the results of the tests, and one DIV element used as a container. The code is as follows: <div>  <button id="run_js" onclick="run_js_test()">Run JS</button> <button id="run_jproxy">Run JProxy</button> <button id="run_dart">Run Dart</button> </div>   <p id="result_js"></p> <p id="result_jproxy"></p> <p id="result_dart"></p> <div id="container"></div> The JavaScript code based on jQuery is as follows: function run_js_test() { var startTime = new Date(); process_js(); var diff = new Date(new Date().getTime() –    startTime.getTime()).getTime(); $('#result_js').text('jQuery tooks ' + diff +    ' ms to process 10000 HTML elements.'); }     function process_js() { var container = $('#container'); // Create 10000 DIV elements for (var i = 0; i < 10000; i++) {    $('<div>Test</div>').appendTo(container); } // Find and update classes of all DIV elements $('#container > div').css("color","red"); // Remove all DIV elements $('#container > div').remove(); } The main code registers the click event listeners and the call function run_dart_js_test. The first parameter of this function must be investigated. The second and third parameters are used to pass the selector of the result element and test the title: void main() { querySelector('#run_jproxy').onClick.listen((event) {    run_dart_js_test(process_jproxy, '#result_jproxy', 'JProxy'); }); querySelector('#run_dart').onClick.listen((event) {    run_dart_js_test(process_dart, '#result_dart', 'Dart'); }); } run_dart_js_test(Function fun, String el, String title) { var startTime = new DateTime.now(); fun(); var diff = new DateTime.now().difference(startTime); querySelector(el).text = '$title tooks ${diff.inMilliseconds} ms to process 10000 HTML elements.'; } Here is the Dart solution based on JProxy and dart:js: process_jproxy() { var container = $('#container'); // Create 10000 DIV elements for (var i = 0; i < 10000; i++) {    $('<div>Test</div>').appendTo(container.object); } // Find and update classes of all DIV elements $('#container > div').css("color","red"); // Remove all DIV elements $('#container > div').remove(); } Finally, a clear Dart solution based on dart:html is as follows: process_dart() { // Create 10000 DIV elements var container = querySelector('#container'); for (var i = 0; i < 10000; i++) {    container.appendHtml('<div>Test</div>'); } // Find and update classes of all DIV elements querySelectorAll('#container > div').forEach((Element el) {    el.style.color = 'red'; }); // Remove all DIV elements querySelectorAll('#container > div').forEach((Element el) {    el.remove(); }); } All the results are in milliseconds. Run the application and wait until the web page is fully loaded. Run each test by clicking on the appropriate button. My result of the tests on Dartium, Chrome, Firefox, and Internet Explorer are shown in the following table: Web browser jQuery framework jQuery via JProxy Library dart:html Dartium 2173 3156 714 Chrome 2935 6512 795 Firefox 2485 5787 582 Internet Explorer 12262 17748 2956 Now, we have the absolute champion—the Dart-based solution. Even the Dart code compiled in the JavaScript code to be executed in Chrome, Firefox, and Internet Explorer works quicker than jQuery (four to five times) and much quicker than dart:js and JProxy class-based solution (four to ten times). Summary This article showed you how to use Dart and JavaScript together to build web applications. It listed problems and solutions you can use to communicate between Dart and JavaScript and the existing JavaScript program. We compared jQuery, JProxy, and dart:js and cleared the Dart code based on the dart:html solutions to identify who is quicker than others. Resources for Article: Further resources on this subject: Handling the DOM in Dart [article] Dart Server with Dartling and MongoDB [article] Handle Web Applications [article]
Read more
  • 0
  • 0
  • 5921