How-To Tutorials

13 Feb 2018

8 min read

Hypothesis testing with R

13 Feb 2018

0
0
6975

article-image-compression-formats-linux-shell-script

Packt

31 Jan 2011

6 min read

Compression Formats in Linux Shell Script

Packt

31 Jan 2011

6 min read

Linux Shell Scripting Cookbook Solve real-world shell scripting problems with over 110 simple but incredibly effective recipes Master the art of crafting one-liner command sequence to perform tasks such as text processing, digging data from files, and lot more Practical problem solving techniques adherent to the latest Linux platform Packed with easy-to-follow examples to exercise all the features of the Linux shell scripting language Part of Packt's Cookbook series: Each recipe is a carefully organized sequence of instructions to complete the task as efficiently as possible Compressing with gunzip (gzip) gzip is a commonly used compression format in GNU/Linux platforms. Utilities such as gzip, gunzip, and zcat are available to handle gzip compression file types. gzip can be applied on a file only. It cannot archive directories and multiple files. Hence we use a tar archive and compress it with gzip. When multiple files are given as input it will produce several individually compressed (.gz) files. Let's see how to operate with gzip. How to do it... In order to compress a file with gzip use the following command: $ gzip filename $ ls filename.gz Then it will remove the file and produce a compressed file called filename.gz. Extract a gzip compressed file as follows: $ gunzip filename.gz It will remove filename.gz and produce an uncompressed version of filename.gz. In order to list out the properties of a compressed file use: $ gzip -l test.txt.gz compressed uncompressed ratio uncompressed_name 35 6 -33.3% test.txt The gzip command can read a file from stdin and also write a compressed file into stdout. Read from stdin and out as stdout as follows: $ cat file | gzip -c > file.gz The -c option is used to specify output to stdout. We can specify the compression level for gzip. Use --fast or the --best option to provide low and high compression ratios, respectively. There's more... The gzip command is often used with other commands. It also has advanced options to specify the compression ratio. Let's see how to work with these features. Gzip with tarball We usually use gzip with tarballs. A tarball can be compressed by using the –z option passed to the tar command while archiving and extracting. You can create gzipped tarballs using the following methods: Method - 1 $ tar -czvvf archive.tar.gz [FILES] Or: $ tar -cavvf archive.tar.gz [FILES] The -a option specifies that the compression format should automatically be detected from the extension. Method - 2First, create a tarball: $ tar -cvvf archive.tar [FILES] Compress it after tarballing as follows: $ gzip archive.tar If many files (a few hundreds) are to be archived in a tarball and need to be compressed, we use Method - 2 with few changes. The issue with giving many files as command arguments to tar is that it can accept only a limited number of files from the command line. In order to solve this issue, we can create a tar file by adding files one by one using a loop with an append option (-r) as follows: FILE_LIST="file1 file2 file3 file4 file5" for f in $FILE_LIST; do tar -rvf archive.tar $f done gzip archive.tar In order to extract a gzipped tarball, use the following: -x for extraction -z for gzip specification Or: $ tar -xavvf archive.tar.gz -C extract_directory In the above command, the -a option is used to detect the compression format automatically. zcat – reading gzipped files without extracting zcat is a command that can be used to dump an extracted file from a .gz file to stdout without manually extracting it. The .gz file remains as before but it will dump the extracted file into stdout as follows: $ ls test.gz $ zcat test.gz A test file # file test contains a line "A test file" $ ls test.gz Compression ratio We can specify compression ratio, which is available in range 1 to 9, where: 1 is the lowest, but fastest 9 is the best, but slowest You can also specify the ratios in between as follows: $ gzip -9 test.img This will compress the file to the maximum. Compressing with bunzip (bzip) bunzip2 is another compression technique which is very similar to gzip. bzip2 typically produces smaller (more compressed) files than gzip. It comes with all Linux distributions. Let's see how to use bzip2. How to do it... In order to compress with bzip2 use: $ bzip2 filename $ ls filename.bz2 Then it will remove the file and produce a compressed file called filename.bzip2. Extract a bzipped file as follows: $ bunzip2 filename.bz2 It will remove filename.bz2 and produce an uncompressed version of filename. bzip2 can read a file from stdin and also write a compressed file into stdout. In order to read from stdin and read out as stdout use: $ cat file | bzip2 -c > file.tar.bz2 -c is used to specify output to stdout. We usually use bzip2 with tarballs. A tarball can be compressed by using the -j option passed to the tar command while archiving and extracting. Creating a bzipped tarball can be done by using the following methods: Method - 1 $ tar -cjvvf archive.tar.bz2 [FILES] Or: $ tar -cavvf archive.tar.bz2 [FILES] The -a option specifies to automatically detect compression format from the extension. Method - 2First create the tarball: $ tar -cvvf archive.tar [FILES] Compress it after tarballing: $ bzip2 archive.tar If we need to add hundreds of files to the archive, the above commands may fail. To fix that issue, use a loop to append files to the archive one by one using the –r option. Extract a bzipped tarball as follows: $ tar -xjvvf archive.tar.bz2 -C extract_directory In this command: -x is used for extraction -j is for bzip2 specification -C is for specifying the directory to which the files are to be extracted Or, you can use the following command: $ tar -xavvf archive.tar.bz2 -C extract_directory -a will automatically detect the compression format. There's more... bunzip has several additional options to carry out different functions. Let's go through few of them. Keeping input files without removing them While using bzip2 or bunzip2, it will remove the input file and produce a compressed output file. But we can prevent it from removing input files by using the –k option. For example: $ bunzip2 test.bz2 -k $ ls test test.bz2 Compression ratio We can specify the compression ratio, which is available in the range of 1 to 9 (where 1 is the least compression, but fast, and 9 is the highest possible compression but much slower). For example: $ bzip2 -9 test.img This command provides maximum compression.

0
0
6968

article-image-type-subtype-and-category-patterns-logical-data-modeling

Packt

24 Oct 2009

4 min read

Type, Subtype, and Category Patterns in Logical Data Modeling

Packt

24 Oct 2009

4 min read

Before I cover the three logical data modeling patterns, let's review briefly how we typically model a type. Let's say you're in a car business. You can model a car as a Car entity shown in the figure below; its sample data values are in following table (I just use six numbers for the VIN, instead of the 17 characters VIN standard, in the sample). VIN (Vehicle Identification Number), the car serial number, is the unique key of the Car entity. The other attributes (Brand, Model, Year, and Manufacturer's Suggested Retail Price) can be thought as the type of a specific car with a unique VIN. So, the type is in the entity itself. Note that you can have more than one car—each with a unique VIN—that have the same type, such as the first three Honda Accord in the sample table. If you have many cars of the same type, or, you have many car types and they're dynamic (have changes: new, update, delete; for example, the update on the MSRP), you can easily recognize that this model is then not suitable—type model is a better solution. VIN Brand Model Year MSRP 123987 Honda Accord 2007 20,000 456321 Honda Accord 2007 20,000 555666 Honda Accord 2007 20,000 678345 Toyota Corolla 2008 21,000 ... ... Type The ER (Entity Relationship) diagram of the following figure shows Car Type and Car entities and their relationship. Car Type defines each type of your cars—a type is a definition of something. The Car is the individual car, each with a serial number (Vehicle Identifier Number) that has a specific type defined in the Car Type. You can think of a Car Type entity as a template used (instantiated) by an individual car. Now you can have as many car types as you need, and type changes don't affect the cars. Table two tables after the figure contain sample data values of the Car Type—Car data model. Note that a car can belong to one car type only. On the other hand, a car type can be the type of many cars. Car Type Key Brand Model Year MSRP 1 Honda Accord 2007 20,000 2 Toyota Corolla 2008 21,000 ... ... ... ... ... VIN Car Type Key Owner 123987 1 Djoni Darmawikarta 456321 1 Kevin Peter 555666 1 Rao Ganipineni 678345 2 Sherman Chang ... ... ... How do we deal with product that doesn't have an individual identifier? Can we apply the same data modeling structure to, for example, commercial books? You certainly have inventory; each Inventory is an instance of the Book Type. The following figure shows the Book Type—Book data model and its sample data values, respectively. You can also apply the same data model to intangible thing, such as Service; an individual service may be identified by, for example, a contract number. The following figure and the last table in the article show the Service Type—Service data model and its sample data values, respectively. Subtype What if you have cars that have different sets of attributes, meaning different types? You can model the different types as subtypes. The following figure shows two subtypes of the Car Type entity: Passenger Car Type and Truck Type. The Car supertype has the common attributes of its subtypes while each of the subtypes has its different attributes. Category While Type is a definition of something, Category is a way to categorize something. While Service can be of only one type, it can be of more than one category—its relationship to Category entity is many-to-many. An example of category for Service is shown in the following figure and its sample data values in the table after it. Note that you need to resolve the many-to-many relationship at implementation time Service Category Key Service Category 1 Bundled 2 Outsourced 3 Onsite 4 Software 5 Hardware ... ... Summary Type, Subtype, and Category are similar patterns for data modeling. This article introduces these three patterns and shows their differences. One or more of them exist in most data model. If your initial data model doesn't have any one of them then you should re-inspect the data model.

0
0
6961

How-To Tutorials

article-image-creating-a-3d-printed-kite

Michael Ang

30 Oct 2014

9 min read

Polygon Construction - Make a 3D printed kite

Michael Ang

30 Oct 2014

9 min read

0
0
6960

How-To Tutorials

article-image-how-greedy-algorithms-work

Richard Gall

10 Apr 2018

2 min read

How greedy algorithms work

Richard Gall

10 Apr 2018

2 min read

What is a greedy algorithm? Greedy algorithms are useful for optimization problems. They make the optimal choice at a localized and immediate level, with the aim being that you’ll find the optimal solution you want. It’s important to note that they don’t always find you the best solution for the data science problem you’re trying to solve - so apply them wisely. In the below video tutorial above, from Fundamental Algorithms in Scala, you'll learn when and how to apply a simple greedy algorithm, and see examples of both an iterative and recursive algorithm in action. with examples of both an iterative algorithm and a recursive algorithm. The advantages and disadvantages of greedy algorithms Greedy algorithms have a number of advantages and disadvantages. While on the one hand, it's relatively easy to come up with them, it is actually pretty challenging to identify the issues around the 'correctness' of your algorithm. That means that ultimately the optimization problem you're trying to solve by using greedy algorithms isn't really a technical issue as such. Instead, it's more of an issue with the scope and definition of your data analysis project. It's a human problem, not a mechanical one. Different ways to apply greedy algorithms There are a number of areas where greedy algorithms can most successfully be applied. In fact, it's worth exploring some of these problems if you want to get to know them in more detail. They should give you a clearer indication of how they work, what makes them useful, and potential drawbacks. Some of the best examples are: Huffman coding Dijkstra's algorithm Continuous knapsack problem To learn more about other algorithms check out these articles: 4 popular algorithms for Distance-based outlier detection 10 machine learning algorithms every engineer needs to know 4 Clustering Algorithms every Data Scientist should know Backpropagation Algorithm To learn to implement specific algorithms, use these tutorials: Creating a reference generator for a job portal using Breadth First Search (BFS) algorithm Implementing gradient descent algorithm to solve optimization problems Implementing face detection using the Haar Cascades and AdaBoost algorithm Getting started with big data analysis using Google’s PageRank algorithm Implementing the k-nearest neighbors' algorithm in Python Machine Learning Algorithms: Implementing Naive Bayes with Spark MLlib

0
0
6952

article-image-understanding-and-developing-node-modules

Packt

11 Aug 2011

5 min read

Understanding and Developing Node Modules

Packt

11 Aug 2011

5 min read

Node Web Development A practical introduction to Node, the exciting new server-side JavaScript web development stack What's a module? Modules are the basic building block of constructing Node applications. We have already seen modules in action; every JavaScript file we use in Node is itself a module. It's time to see what they are and how they work. The following code to pull in the fs module, gives us access to its functions: var fs = require('fs'); The require function searches for modules, and loads the module definition into the Node runtime, making its functions available. The fs object (in this case) contains the code (and data) exported by the fs module. Let's look at a brief example of this before we start diving into the details. Ponder over this module, simple.js: var count = 0; exports.next = function() { return count++; } This defines an exported function and a local variable. Now let's use it: The object returned from require('./simple') is the same object, exports, we assigned a function to inside simple.js. Each call to s.next calls the function next in simple.js, which returns (and increments) the value of the count variable, explaining why s.next returns progressively bigger numbers. The rule is that, anything (functions, objects) assigned as a field of exports is exported from the module, and objects inside the module but not assigned to exports are not visible to any code outside the module. This is an example of encapsulation. Now that we've got a taste of modules, let's take a deeper look. Node modules Node's module implementation is strongly inspired by, but not identical to, the CommonJS module specification. The differences between them might only be important if you need to share code between Node and other CommonJS systems. A quick scan of the Modules/1.1.1 spec indicates that the differences are minor, and for our purposes it's enough to just get on with the task of learning to use Node without dwelling too long on the differences. How does Node resolve require('module')? In Node, modules are stored in files, one module per file. There are several ways to specify module names, and several ways to organize the deployment of modules in the file system. It's quite flexible, especially when used with npm, the de-facto standard package manager for Node. Module identifiers and path names Generally speaking the module name is a path name, but with the file extension removed. That is, when we write require('./simple'), Node knows to add .js to the file name and load in simple.js. Modules whose file names end in .js are of course expected to be written in JavaScript. Node also supports binary code native libraries as Node modules. In this case the file name extension to use is .node. It's outside our scope to discuss implementation of a native code Node module, but this gives you enough knowledge to recognize them when you come across them. Some Node modules are not files in the file system, but are baked into the Node executable. These are the Core modules, the ones documented on nodejs.org. Their original existence is as files in the Node source tree but the build process compiles them into the binary Node executable. There are three types of module identifiers: relative, absolute, and top-level. Relative module identifiers begin with "./" or "../" and absolute identifiers begin with "/". These are identical with POSIX file system semantics with path names being relative to the file being executed. Absolute module identifiers obviously are relative to the root of the file system. Top-level module identifiers do not begin with "." , "..", or "/" and instead are simply the module name. These modules are stored in one of several directories, such as a node_modules directory, or those directories listed in the array require.paths, designated by Node to hold these modules. Local modules within your application The universe of all possible modules is split neatly into two kinds, those modules that are part of a specific application, and those modules that aren't. Hopefully the modules that aren't part of a specific application were written to serve a generalized purpose. Let's begin with implementation of modules used within your application. Typically your application will have a directory structure of module files sitting next to each other in the source control system, and then deployed to servers. These modules will know the relative path to their sibling modules within the application, and should use that knowledge to refer to each other using relative module identifiers. For example, to help us understand this, let's look at the structure of an existing Node package, the Express web application framework. It includes several modules structured in a hierarchy that the Express developers found to be useful. You can imagine creating a similar hierarchy for applications reaching a certain level of complexity, subdividing the application into chunks larger than a module but smaller than an application. Unfortunately there isn't a word to describe this, in Node, so we're left with a clumsy phrase like "subdivide into chunks larger than a module". Each subdivided chunk would be implemented as a directory with a few modules in it. In this example, the most likely relative module reference is to utils.js. Depending on the source file which wants to use utils.js it would use one of the following require statements: var utils = require('./lib/utils'); var utils = require('./utils'); var utils = require('../utils');

0
0
6950

article-image-contexts-and-dependency-injection-netbeans

Packt

06 Feb 2015

18 min read

Contexts and Dependency Injection in NetBeans

Packt

06 Feb 2015

18 min read

0
0
6946

article-image-technical-and-hidden-debts-in-machine-learning-google-engineers-give-their-perspective

Prasad Ramesh

06 Nov 2018

6 min read

Technical and hidden debts in machine learning - Google engineers’ give their perspective

Prasad Ramesh

06 Nov 2018

6 min read

In a paper, Google engineers have pointed out the various costs of maintaining a machine learning system. The paper, Hidden Technical Debt in Machine Learning Systems, talks about technical debt and other ML specific debts that are hard to detect or hidden. They found that is common to incur massive maintenance costs in real-world machine learning systems. They looked at several ML-specific risk factors to account for in system design. These factors include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, configuration issues, changes in the external world, and a number of system-level anti-patterns. Boundary erosion in complex models In traditional software engineering, setting strict abstractions boundaries helps in logical consistency among the inputs and outputs of a given component. It is difficult to set these boundaries in machine learning systems. Yet, machine learning is needed in areas where the desired behavior cannot be effectively expressed with traditional software logic without depending on data. This results in a boundary erosion in a couple of areas. Entanglement Machine learning systems mix signals together, entangle them and make isolated improvements impossible. Change to one input may change all the other inputs and an isolated improvement cannot be done. It is referred to as the CACE principle: Change Anything Changes Everything. There are two possible ways to avoid this: Isolate models and serve ensembles. Useful in situations where the sub-problems decompose naturally. In many cases, ensembles work well as the errors in the component models are not correlated. Relying on this combination creates a strong entanglement and improving an individual model may make the system less accurate. Another strategy is to focus on detecting changes in the prediction behaviors as they occur. Correction cascades There are cases where a problem is only slightly different than another which already has a solution. It can be tempting to use the same model for the slightly different problem. A small correction is learned as a fast way to solve the newer problem. This correction model has created a new system dependency on the original model. This makes it significantly more expensive to analyze improvements to the models in the future. The cost increases when correction models are cascaded. A correction cascade can create an improvement deadlock. Visibility debt caused by undeclared consumers Many times a model is made widely accessible that may later be consumed by other systems. Without access controls, these consumers may be undeclared, silently using the output of a given model as an input to another system. These issues are referred to as visibility debt. These undeclared consumers may also create hidden feedback loops. Data dependencies cost more than code dependencies Data dependencies can carry a similar capacity as dependency debt for building debt, only more difficult to detect. Without proper tooling to identify them, data dependencies can form large chains that are difficult to untangle. They are of two types. Unstable data dependencies For moving along the process quickly, it is often convenient to use signals from other systems as input to your own. But some input signals are unstable, they can qualitatively or quantitatively change behavior over time. This can happen as the other system updates over time or made explicitly. A mitigation strategy is to create versioned copies. Underutilized data dependencies Underutilized data dependencies are input signals that provide little incremental modeling benefit. These can make an ML system vulnerable to change where it is not necessary. Underutilized data dependencies can come into a model in several ways—via legacy, bundled, epsilon or correlated features. Feedback loops Live ML systems often end up influencing their own behavior on being updated over time. This leads to analysis debt. It is difficult to predict the behavior of a given model before it is released in such a case. These feedback loops are difficult to detect and address if they occur gradually over time. This may be the case if the model is not updated frequently. A direct feedback loop is one in which a model may directly influence the selection of its own data for future training. In a hidden feedback loop, two systems influence each other indirectly. Machine learning system anti-patterns It is common for systems that incorporate machine learning methods to end up with high-debt design patterns. Glue code: Using generic packages results in a glue code system design pattern. In that, a massive amount of supporting code is typed to get data into and out of general-purpose packages. Pipeline jungles: Pipeline jungles often appear in data preparation as a special case of glue code. This can evolve organically with new sources added. The result can become a jungle of scrapes, joins, and sampling steps. Dead experimental codepaths: Glue code commonly becomes increasingly attractive in the short term. None of the surrounding structures need to be reworked. Over time, these accumulated codepaths create a growing debt due to the increasing difficulties of maintaining backward compatibility. Abstraction debt: There is a lack of support for strong abstractions in ML systems. Common smells: A smell may indicate an underlying problem in a component system. These can be data smells, multiple-language smell, or prototype smells. Configuration debt Debt can also accumulate when configuring a machine learning system. A large system has a wide number of configurations with respect to features, data selection, verification methods and so on. It is common that configuration is treated an afterthought. In a mature system, config lines can be larger than the code lines themselves and each configuration line has potential for mistakes. Dealing with external world changes ML systems interact directly with the external world and the external world is rarely stable. Some measures that can be taken to deal with the instability are: Fixing thresholds in dynamic systems It is necessary to pick a decision threshold for a given model to perform some action. Either to predict true or false, to mark an email as spam or not spam, to show or not show a given advertisement. Monitoring and testing Unit testing and end-to-end testing cannot ensure complete proper functioning of an ML system. For long-term system reliability, comprehensive live monitoring and automated response is critical. Now there is a question of what to monitor. The authors of the paper point out three areas as starting points—prediction bias, limits for actions, and upstream producers. Other related areas in ML debt In addition to the mentioned areas, an ML system may also face debts from other areas. These include data testing debt, reproducibility debt, process management debt, and cultural debt. Conclusion Moving quickly often introduces technical debt. The most important insight from this paper, according to the authors is that technical debt is an issue that both engineers and researchers need to be aware of. Paying machine learning related technical debt requires commitment, which can often only be achieved by a shift in team culture. Prioritizing and rewarding this effort which needs to be recognized is important for the long-term health of successful machine learning teams. For more details, you can read the paper at NIPS website. Uses of Machine Learning in Gaming Julia for machine learning. Will the new language pick up pace? Machine learning APIs for Google Cloud Platform

0
0
6943

Packt

08 Aug 2013

10 min read

Map Reduce

Packt

08 Aug 2013

10 min read

(For more resources related to this topic, see here.) Map-reduce is a technique that is used to take large quantities of data and farm it out for processing. A somewhat trivial example might be: given 1TB of HTTP log data, count the number of hits that come from a given country, and report those numbers. For example, if you have the log entries: 204.12.226.2 - - [09/Jun/2013:09:12:24 -0700] "GET /who-we-are HTTP/1.0"404 471 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.3; http://www.majestic12.co.uk/bot.php?+)"174.129.187.73 - - [09/Jun/2013:10:58:22 -0700] "GET /robots.txtHTTP/1.1" 404 452 "-" "CybEye.com/2.0 (compatible; MSIE 9.0; Windows NT5.1; Trident/4.0; GTB6.4)"157.55.35.37 - - [02/Jun/2013:23:31:01 -0700] "GET / HTTP/1.1" 200 483"-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"206.183.1.74 - - [02/Jun/2013:18:24:35 -0700] "GET / HTTP/1.1" 200 482"-" "Mozilla/4.0 (compatible; http://search.thunderstone.com/texis/websearch/about.html)"1.202.218.21 - - [02/Jun/2013:17:38:20 -0700] "GET /robots.txt HTTP/1.1"404 471 "-" "Mozilla/5.0 (compatible; JikeSpider; +http://shoulu.jike.com/spider.html)" Then the answer to the question would be as follows: US: 4China: 1 Clearly this example dataset does not warrant distributing the data processing among multiple machines, but imagine if instead of five rows of log data we had twenty-five billion rows. If your program took a single computer a half a second to process five records, it would take a little short of eighty years to process twenty-five billion records. To solve for this, we could break up the data into smaller chunks and then process those smaller chunks, rejoining them when we were finished. To apply this to a slightly larger dataset, imagine you extrapolated these five records to one hundred records and then split those one hundred records into five groups, each containing twenty records. From those five groups we might compute the following results: Group 1 Group 2 Group 3 Group 4 Group 5 US 5 Mexico 2 US 15 Italy 1 Finland 5 Greece 4 Scotland 6 China 2 Greece 4 China 5 Ireland 8 Canada 9 Finland 3 Scotland 10 US 10 Canada 3 Ireland 3 US 5 If we were to combine these data points by using the country name as a key and store them in a map, adding the value to any existing value, we would get the count per country across all one hundred records. Using Ruby, we can write a simple program to do this, first without using Gearman, and then with it. To demonstrate this, we will write the following: A simple library that we can use in our non-distributed program and in our Gearman-enabled programs An example program that demonstrates using the library A client that uses the library to split up our data and submit jobs to our manager A worker that uses the library to process the job requests and return the results The shared library First we will develop a library that we can reuse. This will demonstrate that you can reuse existing logic to quickly take advantage of Gearman because it ensures the following things: The program, client, and worker are much simpler so we can see what's going on in them The behavior between our program, client, and worker is guaranteed to be consistent The shared library will have two methods, map_data and reduce_data. The map_data method will be responsible for splitting up the data into chunks to be processed, and the reduce_data method will process those chunks of data and return something that can be merged together into an accurate answer. Take the following example, and save it to a file named functions.rb for later use: #!/bin/env ruby# Generate sub-lists of the data# each sub-list has size = blocksizedef map_data(lines, blocksize)blocks = []counter = 0block = []lines.each do |line|if (counter >= blocksize)blocks << blockblock = []counter = 0endblock << linecounter += 1endblocks << block if block.size> 0blocksend# Extract the number of times we see a unique line# Result is a hash with key = line, value = countdef reduce_data(lines)results = {}lines.each do |line|results[line] ||= 0results[line] += 1endresultsend A simple program To use this library, we can write a very simple program that demonstrates the functionality: require './functions.rb'countries = ["china", "us", "greece", "italy"]lines = []results = {}(1..100).each { |i| lines << countries[i % 4] }blocks = map_data(lines, 20)blocks.each do |block|reduce_data(block).each do |k,v|results[k] ||= 0results[k] += vendendputs results.inspect Put the contents of this example into a Ruby source file, named mapreduce.rb in the same directory as you placed your functions.rb file, and execute it with the following: [user@host:$] ruby ./mapreduce.rb This script will generate a list with one hundred elements in it. Since there are four distinct elements, each will appear 25 times as the following output shows: {"us"=>25, "greece"=>25, "italy"=>25, "china"=>25} Following in this vein, we can add in Gearman to extend our example to operate using a client that submits jobs and a single worker that will process the results serially to generate the same results. The reason we wrote these methods in a separate module from the driver application was to make them reusable in this fashion. The client The following code for the client in this example will be responsible for the mapping phase, it will split apart the results and submit jobs for the blocks of data it needs processed. In this example worker/client setup, we are using JSON as a simple way to serialize/deserialize data being sent back and forth: require 'rubygems'require 'gearman'require 'json'require './functions.rb'client = Gearman::Client.new('localhost:4730')taskset = Gearman::TaskSet.new(client)countries = ["china", "us", "greece", "italy"]jobcount = 1lines = []results = {}(1..100).each { |i| lines << countries[i % 4] }blocks = map_data(lines, 20)blocks.each do |block|# Generate a task with a unique iduniq = rand(36**8).to_s(36)task = Gearman::Task.new('count_countries',JSON.dump(block),:uniq =>uniq)# When the task is complete, add its results into ourstask.on_complete do |d|# We are passing data back and forth as JSON, so# decode it to a hash and then iterate over the# k=>v pairsJSON.parse(d).each do |k,v|results[k] ||= 0results[k] += vendendtaskset.add_task(task)puts "Submitted job #{jobcount}"jobcount += 1endputs "Submitted all jobs, waiting for results."start_time = Time.nowtaskset.wait(100)time_diff = (Time.now - start_time).to_iputs "Took #{time_diff} seconds: #{results.inspect}" This client uses a few new concepts that were not used in the introductory examples, that is, task sets and unique identifiers. In the Ruby client, a task set is a group of tasks that are submitted together and can be waited upon collectively. To generate a task set, you construct it by giving it the client that you want to submit the task set with: taskset = Gearman::TaskSet.new(client) Then you can create and add tasks to the task set: task = Gearman::Task.new('count_countries',JSON.dump(block), :uniq =>uniq)taskset.add_task(task) Finally, you tell the task set how long you want to wait for the results: taskset.wait(100) This will block the program until the timeout passes, or all the tasks in the task set complete hold true (again, complete does necessarily mean that the worker succeeded at the task, but that it saw it to completion). In this example, it will wait 100 seconds for all the tasks to complete before giving up on them. This doesn't mean that the jobs won't complete if the client disconnects, just that the client won't see the end results (which may or may not be acceptable). The worker To complete the distributed MapReduce example, we need to implement the worker that is responsible for performing the actual data processing. The worker will perform the following tasks: Receive a list of countries serialized as JSON from the manager Decode that JSON data into a Ruby structure Perform the reduce operation on the data converting the list of countries into a corresponding hash of counts Serialize the hash of counts as a JSON string Return the JSON string to the manager (to be passed on to the client) require 'rubygems'require 'gearman'require 'json'require './functions.rb'Gearman::Util.logger.level = Logger::DEBUG@servers = ['localhost:4730']w = Gearman::Worker.new(@servers)w.add_ability('count_countries') do |json_data,job|puts "Received: #{json_data}"data = JSON.parse(json_data)result = reduce_data(data)puts "Result: #{result.inspect}"returndata = JSON.dump(result)puts "Returning #{returndata}"sleep 4returndataendloop { w.work } Notice that we have introduced a slight delay in returning the results by instructing our worker to sleep for four seconds before returning the data. This is here in order to simulate a job that takes a while to process. To run this example, we will repeat the exercise from the first section. Save the contents of the client to a file called mapreduce_client.rb, and then contents of the worker to a file named mapreduce_worker.rb in the same directory as the functions.rb file. Then, start the worker first by running the following: ruby mapreduce_worker.rb And then start the client by running the following: ruby mapreduce_client.rb When you run these scripts, the worker will be waiting to pick up jobs, and then the client will generate five jobs, each with a block containing a list of countries to be counted, and submit them to the manager. These jobs will be picked up by the worker and then processed, one at a time, until they are all complete. As a result there will be a twenty second difference between when the jobs are submitted and when they are completed. Parallelizing the pipeline Implementing the solution this way clearly doesn't gain us much performance from the original example. In fact, it is going to be slower (even ignoring the four second sleep inside each job execution) than the original because there is time involved in serialization and deserialization of the data, transmitting the data between the actors, and transmitting the results between the actors. The goal of this exercise is to demonstrate building a system that can increase the number of workers and parallelize the processing of data, which we will see in the following exercise. To demonstrate the power of parallel processing, we can now run two copies of the worker. Simply open a new shell and execute the worker via ruby mapreduce_worker.rb and this will spin up a second copy of the worker that is ready to process jobs. Now, run the client a second time and observe the behavior. You will see that the client has completed in twelve seconds instead of twenty. Why not ten? Remember that we submitted five jobs, and each will take four seconds. Five jobs do not get divided evenly between two workers and so one worker will acquire three jobs instead of two, which will take it an additional four seconds to complete: [user@host]% ruby mapreduce_client.rbSubmitted job 1Submitted job 2Submitted job 3Submitted job 4Submitted job 5Submitted all jobs, waiting for results.Took 12 seconds: {"us"=>25, "greece"=>25, "italy"=>25, "china"=>25} Feel free to experiment with the various parameters of the system such as running more workers, increasing the number of records that are being processed, or adjusting the amount of time that the worker sleeps during a job. While this example does not involve processing enormous quantities of data, hopefully you can see how this can be expanded for future growth. Summary In this article, we have discussed MapReduce technique. Hope this article gives you a glimpse of how the book flows. Resources for Article : Further resources on this subject: BPMN 2.0 Concepts and The Sales Quote Process [Article] Simplifying Parallelism Complexity in C# [Article] Oracle BPM Suite 11gR1: Creating a BPM Application [Article]

0
0
6941

Packt

06 Jul 2017

9 min read

Ruby Strings

Packt

06 Jul 2017

9 min read

In this article by Jordan Hudgens, the author of the book Comprehensive Ruby Programming, you'll learn about the Ruby String data type and walk through how to integrate string data into a Ruby program. Working with words, sentences, and paragraphs are common requirements in many applications. Additionally you learn how to: Employ string manipulation techniques using core Ruby methods Demonstrate how to work with the string data type in Ruby (For more resources related to this topic, see here.) Using strings in Ruby A string is a data type in Ruby and contains set of characters, typically normal English text (or whatever natural language you're building your program for), that you would write. A key point for the syntax of strings is that they have to be enclosed in single or double quotes if you want to use them in a program. The program will throw an error if they are not wrapped inside quotation marks. Let's walk through three scenarios. Missing quotation marks In this code I tried to simply declare a string without wrapping it in quotation marks. As you can see, this results in an error. This error is because Ruby thinks that the values are classes and methods. Printing strings In this code snippet we're printing out a string that we have properly wrapped in quotation marks. Please note that both single and double quotation marks work properly. It's also important that you do not mix the quotation mark types. For example, if you attempted to run the code: puts "Name an animal' You would get an error, because you need to ensure that every quotation mark is matched with a closing (and matching) quotation mark. If you start a string with double quotation marks, the Ruby parser requires that you end the string with the matching double quotation marks. Storing strings in variables Lastly in this code snippet we're storing a string inside of a variable and then printing the value out to the console. We'll talk more about strings and string interpolation in subsequent sections. String interpolation guide for Ruby In this section, we are going to talk about string interpolation in Ruby. What is string interpolation? So what exactly is string interpolation? Good question. String interpolation is the process of being able to seamlessly integrate dynamic values into a string. Let's assume we want to slip dynamic words into a string. We can get input from the console and store that input into variables. From there we can call the variables inside of a pre-existing string. For example, let's give a sentence the ability to change based on a user's input. puts "Name an animal" animal = gets.chomp puts "Name a noun" noun= gets.chomp p "The quick brown #{animal} jumped over the lazy #{noun} " Note the way I insert variables inside the string? They are enclosed in curly brackets and are preceded by a # sign. If I run this code, this is what my output will look: So, this is how you insert values dynamically in your sentences. If you see sites like Twitter, it sometimes displays personalized messages such as: Good morning Jordan or Good evening Tiffany. This type of behavior is made possible by inserting a dynamic value in a fixed part of a string and leverages string interpolation. Now, let's use single quotes instead of double quotes, to see what happens. As you'll see, the string was printed as it is without inserting the values for animal and noun. This is exactly what happens when you try using single quotes—it prints the entire string as it is without any interpolation. Therefore it's important to remember the difference. Another interesting aspect is that anything inside the curly brackets can be a Ruby script. So, technically you can type your entire algorithm inside these curly brackets, and Ruby will run it perfectly for you. However, it is not recommended for practical programming purposes. For example, I can insert a math equation, and as you'll see it prints the value out. String manipulation guide In this section we are going to learn about string manipulation along with a number of examples of how to integrate string manipulation methods in a Ruby program. What is string manipulation? So what exactly is string manipulation? It's the process of altering the format or value of a string, usually by leveraging string methods. String manipulation code examples Let's start with an example. Let's say I want my application to always display the word Astros in capital letters. To do that, I simply write: "Astros".upcase Now if I always a string to be in lower case letters I can use the downcase method, like so: "Astros".downcase Those are both methods I use quite often. However there are other string methods available that we also have at our disposal. For the rare times when you want to literally swap the case of the letters you can leverage the swapcase method: "Astros".swapcase And lastly if you want to reverse the order of the letters in the string we can call the reverse method: "Astros".reverse These methods are built into the String data class and we can call them on any string values in Ruby. Method chaining Another neat thing we can do is join different methods together to get custom output. For example, I can run: "Astros".reverse.upcase The preceding code displays the value SORTSA. This practice of combining different methods with a dot is called method chaining. Split, strip, and join guides for strings In this section, we are going to walk through how to use the split and strip methods in Ruby. These methods will help us clean up strings and convert a string to an array so we can access each word as its own value. Using the strip method Let's start off by analyzing the strip method. Imagine that the input you get from the user or from the database is poorly formatted and contains white space before and after the value. To clean the data up we can use the strip method. For example: str = " The quick brown fox jumped over the quick dog " p str.strip When you run this code, the output is just the sentence without the white space before and after the words. Using the split method Now let's walk through the split method. The split method is a powerful tool that allows you to split a sentence into an array of words or characters. For example, when you type the following code: str = "The quick brown fox jumped over the quick dog" p str.split You'll see that it converts the sentence into an array of words. This method can be particularly useful for long paragraphs, especially when you want to know the number of words in the paragraph. Since the split method converts the string into an array, you can use all the array methods like size to see how many words were in the string. We can leverage method chaining to find out how many words are in the string, like so: str = "The quick brown fox jumped over the quick dog" p str.split.size This should return a value of 9, which is the number of words in the sentence. To know the number of letters, we can pass an optional argument to the split method and use the format: str = "The quick brown fox jumped over the quick dog" p str.split(//).size And if you want to see all of the individual letters, we can remove the size method call, like this: p str.split(//) And your output should look like this: Notice, that it also included spaces as individual characters which may or may not be what you want a program to return. This method can be quite handy while developing real-world applications. A good practical example of this method is Twitter. Since this social media site restricts users to 140 characters, this method is sure to be a part of the validation code that counts the number of characters in a Tweet. Using the join method We've walked through the split method, which allows you to convert a string into a collection of characters. Thankfully, Ruby also has a method that does the opposite, which is to allow you to convert an array of characters into a single string, and that method is called join. Let's imagine a situation where we're asked to reverse the words in a string. This is a common Ruby coding interview question, so it's an important concept to understand since it tests your knowledge of how string work in Ruby. Let's imagine that we have a string, such as: str = "backwards am I" And we're asked to reverse the words in the string. The pseudocode for the algorithm would be: Split the string into words Reverse the order of the words Merge all of the split words back into a single string We can actually accomplish each of these requirements in a single line of Ruby code. The following code snippet will perform the task: str.split.reverse.join(' ') This code will convert the single string into an array of strings, for the example it will equal ["backwards", "am", "I"]. From there it will reverse the order of the array elements, so the array will equal: ["I", "am", "backwards"]. With the words reversed, now we simply need to merge the words into a single string, which is where the join method comes in. Running the join method will convert all of the words in the array into one string. Summary In this article, we were introduced to the string data type and how it can be utilized in Ruby. We analyzed how to pass strings into Ruby processes by leveraging string interpolation. We also learned the methods of basic string manipulation and how to find and replace string data. We analyzed how to break strings into smaller components, along with how to clean up string based data. We even introduced the Array class in this article. Resources for Article: Further resources on this subject: Ruby and Metasploit Modules [article] Find closest mashup plugin with Ruby on Rails [article] Building tiny Web-applications in Ruby using Sinatra [article]

0
0
6940

article-image-chaos-engineering-managing-complexity-by-breaking-things

Richard Gall

20 Apr 2018

7 min read

Chaos Engineering: managing complexity by breaking things

Richard Gall

20 Apr 2018

7 min read

Chaos Engineering is based on a fundamental assertion about software infrastructure today: that it is inherently chaotic. Or, to be more specific, it is chaotic because it is complex. Whereas software infrastructure used to be centralized, owned and licensed by large enterprise vendors, today much of the software that comprises infrastructure is open source. This is where we get back to chaos - because software infrastructure is comprised of many different parts, the way these parts can be unpredictable. Chaos Engineering is an attempt to acknowledge that fact and develop software accordingly. Who invented Chaos Engineering? Chaos Engineering began at Netflix. That makes sense when you consider the complexity of the Netflix technology stack and the way the company have scaled over the last 5 years or so. It built a number of tools to help adopt this chaos-first approach, the most prominent being Chaos Monkey. First launched in 2011 and open-sourced in 2012, Chaos Monkey was a tool that randomly selects instances in production and pulls them down; a little bit like monkeys pulling off your windscreen wipers in a safari park. However, Chaos Monkey became part of a wider suite of tools - called the Simian Army - that were built by Netflix to cause chaos in different part of its infrastructure. Here are the other two components used to simulate chaos: Chaos Gorilla causes big trouble by pulling down an entire AWS availability zone Latency monkey delays communication, essentially simulating poor network performance From that point Chaos Engineering grew. A number of large Silicon Valley organizations have adopted a similar approaches. For example, Facebook's Project Storm simulates data center failures on a huge scale, while Uber uses a tool called uDestroy. Slack has recently spoken in detail on the importance of stress testing their software too; the company is looking to build an engineering team simply to perform Chaos Engineering and improve Slack's reliability. One of the most interesting figures in Chaos Engineering is a man called Kolton Andrus. Andrus used to work at Amazon and Google, but today he is the CEO and founder of Gremlin, a startup that "helps engineers build resilient systems". Essentially, Andrus helped to develop the concept of Chaos Engineering while he was working at Netflix. Gremlin is his vehicle that is making it accessible to others. Chaos Engineering in practice Now the conceptual stuff is out of the way, here's how chaos engineering works. It's actually quite straightforward: Chaos Engineering simulates all sorts of unpredictable situations and scenarios in order to see how the system responds. It's effectively a form of stress testing. As we've seen, over the past few years companies have built their own tools to allow them to stress test their infrastructure. But Gremlin is taking the approach of offering this as a service. It's product is described as 'resiliency-as-a-service.' Its' product is a whole library of 'attacks' which can replicate different types of outages within a system. These are what it calls 'chaos experiments' that allows you to 'identify weak points in your system and fix them before they become a problem'. In this sense, Chaos Engineering is a bit like using the principles of penetration testing an applying it to software testing more broadly. By simulating everything that could possibly go wrong it allows you to make much better optimization decisions. The principles of Chaos Engineering are documented here. This is effectively its 'manifesto'. There's a lot in there worth reading, but here are the 5 principles that any sort of testing or experimentation should aspire to: Base your testing hypothesis on steady state behavior. Consider your infrastructure holistically, making individual parts work is important but not the priority. Simulate a variety of real-world events. This could be hardware or software failures, or simply external changes like spikes in traffic. What's important is that they're all unpredictable. Test in production. Your tests should be authentic. Automate! Testing things could be laborious and require a lot of manual work. Make use of automation tools to do lots of different tests without taking up too much of your time. Don't cause unnecessary pain. While it's important that your stress-tests are authentic, the impact must be contained and minimized by the engineer. Why Chaos Engineering now? Chaos Engineering isn't particularly new. As you've seen, Netflix has been doing it since 2011. But it does feel more urgent and relevant today. That's because the complexity of the software infrastructure behind many of the biggest Silicon Valley companies is now mainstream. It's normal. Cloud isn't an exotic buzzword any more - it's a reality (a reality that often has failures). Microservices are common - they're a commonsense way of building better applications and websites. Alongside this increased complexity, there is also a growing awareness of how much software outages can cost businesses. In a white paper, Gremlin make a big deal out of how much money is lost due to outages. Gremlin cite BAs system failure in summer 2017, which led to passengers stranded all over the world. This outage was estimated to have cost BA $135 million. It also refers to the Amazon S3 outage in March 2017, which is believed to have cost Amazon's customers $150 million. So - outages cost money. Yes, it's marketing spiel from Gremlin, but it's also true. It doesn't take a genius to work out that if you're eCommerce site is down for an hour, you're going to have lost a lot of money. Because software performance is so tied up with business performance, it feels incredibly fragile. That's why Chaos Engineering is perhaps more important and popular than ever. It's a way of countering that fragility. The key challenges of Chaos Engineering Chaos Engineering poses many challenges to software engineering teams. First and foremost, it requires a big cultural change. If you're intent on breaking everything, there are no rules about how things should work or what you're trying to build. Instead you're looking for the best way to build software that performs for the user. More practically, Chaos Engineering isn't that easy to do in a cost-effective manner. Everything Gremlin details in its white paper is very much true - of course outages cost a hell of a lot. But creative destruction and experimentation feels like an expensive route through software projects. It's not hard to see how it might appear self-indulgent, especially to a company or organization where software isn't properly understood. And more to the point, how often do businesses actually do the smart thing when they're building software? Long term projects are always difficult. So much software evolves pragmatically - often for the worse. Adding in an extra layer of experimentation and detailed testing is a weird mix of bacchanalian and hyper-organized, something that many organizations just couldn't process or properly understand. Chaos engineering and the future of software development Chaos Engineering certainly looks like the future of software development. The only question is whether services like those provided by Gremlin will take off. To understand the true value of stress testing your infrastructure you do need at least a modicum awareness of the complexity of your infrastructure. Indeed, you probably need to have a conversation about what services and dependencies are most business critical. Or rather, which ones most impact the user. That's something this TechCrunch piece addresses: "Testing can... be very political. Finding the points of failure in a system might force deep conversations about a particular software architecture and its robustness in the face of tough situations. A particular company might be deeply invested in a specific technical roadmap (e.g. microservices) that chaos engineering tests show is not as resilient to failures as originally predicted." This means there is going to be a question mark over the extent to which Chaos Engineering ever really enters the mainstream. How many businesses want to have these conversations? It's not just about the inclination - it's also about the time and money. It's an innovative software engineering approach that really calls people's bluff when they talk about innovation. It asks difficult questions about how and why you innovate: do you do new things because you think you should? Is this new thing going to be good for the business? And how well will it work for users? Of course these questions are vital when you're building software. But they rarely make building software easier.

0
0
6936

article-image-using-nodejs-dependencies-nwjs

Max Gfeller

19 Nov 2015

6 min read

Using Node.js dependencies in NW.js

Max Gfeller

19 Nov 2015

6 min read

NW.js (formerly known as node-webkit) is a framework that makes it possible to write multi-platform desktop applications using the technologies you already know well: HTML, CSS and JavaScript. It bundles a Chromium and a Node (or io.js) runtime and provides additional APIs to implement native-like features like real menu bars or desktop notifications. A big advantage of having a Node/io.js runtime is to be able to make use of all the modules that are available for node developers. We can categorize three different types of modules that we can use. Internal modules Node comes with a solid set of internal modules like fs or http. It is built on the UNIX philosophy of doing only one thing and doing it very well. Therefore you won't find too much functionality in node core. The following modules are shipped with node: assert: used for writing unit tests buffer: raw memory allocation used for dealing with binary data child_process: spawn and use child processes cluster: take advatage of multi-core systems crypto: cryptographic functions dgram: use datagram sockets - dns: perform DNS lookups domain: handle multiple different IO operations as a single group events: provides the EventEmitter fs: operations on the file system http: perform http queries and create http servers https: perform https queries and create https servers net: asynchronous network wrapper os: basic operating-system related utility functions path: handle and transform file paths punycode: deal with punycode domain names querystring: deal with query strings stream: abstract interface implemented by various objects in Node timers: setTimeout, setInterval etc. tls: encrypted stream communication url: URL resolution and parsing util: various utility functions vm: sandbox to run Node code in zlib: bindings to Gzip/Gunzip, Deflate/Inflate, and DeflateRaw/InflateRaw Those are documented on the official Node API documentation and can all be used within NW.js. Please take care that Chromium already defines a crypto global, so when using the crypto module in the webkit context you should assign it to a variable like crypt rather than crypto: var crypt = require('crypto'); The following example shows how we would read a file and use its contents using Node's modules: var fs = require('fs'); fs.readFile(__dirname + '/file.txt', function (error, contents) { if (error) returnconsole.error(error); console.log(contents); }); 3rd party JavaScript modules Soon after Node itself was started, Isaac Schlueter, who was friend of creator Ryan Dahl, started working on a package manager for Node itself. While Nodes's popularity reached new highs, a lot of packages got added to the npm registry and it soon became the fastest growing package registry. To the time of this writing there are over 169'000 packages on the registry and nearly two billion downloads each month. The npm registry is now also slowly evolving from being "only" a package manager for Node into a package manager for all things Javascript. Most of these packages can also be used inside NW.js applications. Your application's dependencies are being defined in your package.json file in the dependencies(or devDependencies) section: { "name": "my-cool-application", "version": "1.0.0", "dependencies": { "lodash": "^3.1.2" }, "devDependencies": { "uglify-js": "^2.4.3" } } In the dependencies field you find all the modules that are required to run your application while in the devDependencies field only the modules required while developing the application are found. Installing a module is fairly easy and the best way to do this is with the npm install command: npm install lodash --save The install command directly downloads the latest version into your node_modules/ folder. The --save flag means that this dependency should also directly be written into your package.json file. You can also define a specific version to download by using following notation: npm install lodash@1.* or even npm install [email protected] How does node's require() work? You need to deal with two different contexts in NW.js and it is really important to always know which context you are currently in as it changes the way the require() function works. When you load a moule using Node's require() function, then this module runs in the Node context. That means you have the same globals as you would have in a pure Node script but you can't access the globals from the browser, e.g. document or window. If you write Javascript code inside of a <script> tag in your html, or when you include a script inside your HTML using <script src="">, then this code runs in the webkit context. There you have access to all browsers globals. In the webkit context The require() function is a module loading system defined by the CommonJS Modules 1.0 standard and directly implemented in node core. To offer the same smooth experience you get a modified require() method that works in webkit, too. Whenever you want to include a certain module from the webkit context, e.g. directly from an inline script in your index.html file, you need to specify the path directly from the root of your project. Let's assume the following folder structure: - app/ - app.js - foo.js - bar.js - index.html And you want to include the app/app.js file directly in your index.html you need to include it like this: <script type="text/javascript"> var app = require('./app/app.js'); </script> If you need to use a module from npm then you can simply require() it and NW.js will figure out where the corresponding node_modules/ folder is located. In the node context In node when you use relative paths it will always try to locate this module relative to the file you are requiring it from. If we take the example from above then we could require the foo.js module from app.js like this: var foo = require('./foo'); About the Author Max Gfeller is a passionate web developer and JavaScript enthusiast. He is making awesome things at Cylon and can be found on Twitter @mgefeller.

0
0
6930

Packt

23 Jun 2017

20 min read

To Optimize Scans

Packt

23 Jun 2017

20 min read

0
0
6916

article-image-data-access-methods-flex-3

Packt

06 Oct 2009

4 min read

Data Access Methods in Flex 3

Packt

06 Oct 2009

4 min read

Flex provides a range of data access components to work with server-side remote data. These components are also commonly called Remote Procedure Call or RPC services. The Remote Procedure Call allows you to execute remote or server-side procedure or method, either locally or remotely in another address space without having to write the details of that method explicitly. (For more information on RPC, visit http://en.wikipedia.org/wiki/Remote_procedure_call.) The Flex data access components are based on Service Oriented Architecture (SOA). The Flex data access components allow you to call the server-side business logic built in Java, PHP, ColdFusion, .NET, or any other server-side technology to send and receive remote data. In this article by Satish Kore, we will learn how to interact with a server environment (specifically built with Java). We will look at the various data access components which includes HTTPService class and WebService class. This article focuses on providing in-depth information on the various data access methods available in Flex. Flex data access components Flex data access components provide call-and-response model to access remote data. There are three methods for accessing remote data in Flex: via the HTTPService class, the WebService class (compliant with Simple Object Access Protocol, or SOAP), or the RemoteObject class. The HTTPService class The HTTPService class is a basic method to send and receive remote data over the HTTP protocol. It allows you to perform traditional HTTP GET and POST requests to a URL. The HTTPService class can be used with any kind of service-side technology such as JSP, PHP, ColdFusion, ASP, and so on. The HTTP service is also known as a REST-style web service. REST stands for Representational State Transfer. It is an approach for getting content from a web server via web pages. Web pages are written in many languages such as JSP, PHP, ColdFusion, ASP, and so on that receive POST or GET requests. Output can be formatted in XML instead of HTML, which can be used in Flex applications for manipulating and displaying content. For example, RSS feeds output standard XML content when accessed using URL. For more information about REST, see www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm. Using the HTTPService tag in MXML The HTTPService class can be used to load both static and dynamic content from remote URLs. For example, you can use HTTPService to load an XML fi le that is located on a server, or you can also use HTTPService in conjunction with server-side technologies that return dynamic results to the Flex application. You can use both the HTTPS and the HTTP protocol to access secure and non-secure remote content. The following is the basic syntax of a HTTPService class: <mx:HTTPService id="instanceName" method="GET|POST|HEAD|OPTIONS|PUT|TRACE|DELETE" resultFormat="object|array|xml|e4x|flashvars|text" url="http://www.mydomain.com/myFile.jsp" fault="faultHandler(event);" result="resultHandler(event)"/> The following are the properties and events: id: Specifies instance identifier name method: Specifies HTTP method; the default value is GET resultFormat: Specifies the format of the result data url: Specifies the complete remote URL which will be called fault: Specifies the event handler method that'll be called when a fault or error occurs while connecting to the URL result: Specifies the event handler method to call when the HttpService call completed successfully and returned results When you call the HTTPService.send() method, Flex will make the HTTP request to the specified URL. The HTTP response will be returned in the result property of the ResultEvent class in the result handler of the HTTPService class. You can also pass an optional parameter to the send() method by passing in an object. Any attributes of the object will be passed in as a URL-encoded query string along with URL request. For example: var param:Object = {title:"Book Title", isbn:"1234567890"}myHS.send(param); In the above code example, we have created an object called param and added two string attributes to it: title and isbn. When you pass the param object to the send() method, these two string attributes will get converted to URL-encoded http parameters and your HTTP request URL will look like this: http://www.domain.com/myjsp.jsp?title=Book Title &isbn=1234567890. This way, you can pass any number of HTTP parameters to the URL and these parameters can be accessed in your server-side code. For example, in a JSP or servlet, you could use request.getParameter("title") to get the HTTP request parameter.

0
0
6913

article-image-6-javascript-micro-optimizations-need-know

Savia Lobo

05 Apr 2018

18 min read

6 JavaScript micro optimizations you need to know

Savia Lobo

05 Apr 2018

18 min read

JavaScript micro optimizations can improve the performance of your JavaScript code. This means you can get it to do more - this is essential especially when thinking about the scale of modern web applications, as greater efficiencies in code can lead to much stronger overall performance. Let us have a look at micro optimizations in detail. Truthy/falsy comparisons We have all, at some point, written if conditions or assigned default values by relying on the truthy or falsy nature of the JavaScript variables. As helpful as it is most of the times, we will need to consider the impact that such an operation would cause on our application. However, before we jump into the details, let's discuss how any condition is evaluated in JavaScript, specifically an if condition in this case. As a developer, we tend to do the following: if(objOrNumber) { // do something } This works for most of the cases, unless the number is 0, in which case it gets evaluated to false. That is a very common edge case, and most of us catch it anyway. However, what does the JavaScript engine have to do to evaluate this condition? How does it know whether the objOrNumber evaluates to true or false? Let's return to our ECMA262 specs and pull out the IF condition spec (https://www.ecma-international.org/ecma-262/5.1/#sec-12.5). The following is an excerpt of the same: Semantics The production IfStatement : If (Expression) Statement else Statement Statement is evaluated as follows: Let exprRef be the result of evaluating Expression. If ToBoolean(GetValue(exprRef)) is true, then Return the result of evaluating the first Statement. Else, Return the result of evaluating the second Statement. Now, we note that whatever expression we pass goes through the following three steps: Getting the exprRef from Expression. GetValue is called on exprRef. ToBoolean is called as the result of step 2. Step 1 does not concern us much at this stage; think of it this way—an expression can be something like a == b or something like the shouldIEvaluateTheIFCondition() method call, that is, something that evaluates your condition. Step 2 extracts the value of the exprRef, that is, 10, true, undefined. In this step, we differentiate how the value is extracted based on the type of the exprRef. You can refer to the details of GetValue here. Step 3 then converts the value extracted from Step 2 into a Boolean value based on the following table (taken from https://www.ecma-international.org/ecma-262/5.1/#sec-9. 2): At each step, you can see that it is always beneficial if we are able to provide the direct boolean value instead of a truthy or falsy value. Looping optimizations We can do a deep-down dive into the for loop, similar to what we did with the if condition earlier (https://www.ecma-international.org/ecma-262/5.1/#sec-12.6.3), but there are easier and more obvious optimizations which can be applied when it comes to loops. Simple changes can drastically affect the quality and performance of the code; consider this for example: for(var i = 0; i < arr.length; i++) { // logic } The preceding code can be changed as follows: var len = arr.length; for(var i = 0; i < len; i++) { // logic } What is even better is to run the loops in reverse, which is even faster than what we have seen previously: var len = arr.length; for(var i = len; i >= 0; i--) { // logic } The conditional function call Some of the features that we have within our applications are conditional. For example, logging or analytics fall into this category. Some of the applications may have logging turned off for some time and then turned back on. The most obvious way of achieving this is to wrap the method for logging within an if condition. However, since the method could be triggered a lot of times, there is another way in which we can make the optimization in this case: function someUserAction() { // logic if (analyticsEnabled) { trackUserAnalytics(); } } // in some other class function trackUserAnalytics() { // save analytics } Instead of the preceding approach, we can instead try to do something, which is only slightly different but allows V8-based engines to optimize the way the code is executed: function someUserAction() { // logic trackUserAnalytics(); } // in some other class function toggleUserAnalytics() { if(enabled) { trackUserAnalytics = userAnalyticsMethod; } else { trackUserAnalytics = noOp; } } function userAnalyticsMethod() { // save analytics } // empty function function noOp {} Now, the preceding implementation is a double-edged sword. The reason for that is very simple. JavaScript engines employ a technique called inline caching (IC), which means that any previous lookup for a certain method performed by the JS engine will be cached and reused when triggered the next time; for example, if we have an object that has a nested method, a.b.c, the method a.b.c will be only looked up once and stored on cache (IC); if a.b.c is called the next time, it will be picked up from IC, and the JS engine will not parse the whole chain again. If there are any changes to the a.b.c chain, then the IC gets invalidated and a new dynamic lookup is performed the next time instead of being retrieved from the IC. So, from our previous example, when we have noOp assigned to the trackUserAnalytics() method, the method path gets tracked and saved within IC, but it internally removes this function call as it is a call to an empty method. However, when it is applied to an actual function with some logic in it, IC points it directly to this new method. So, if we keep calling our toggleUserAnalytics() method multiple times, it keeps invalidating our IC, and our dynamic method lookup has to happen every time until the application state stabilizes (that is, toggleUserAnalytics() is no longer called). Image and font optimizations When it comes to image and font optimizations, there are no limits to the types and the scale of optimization that we can perform. However, we need to keep in mind our target audience, and we need to tailor our approach based on the problem at hand. With both images and fonts, the first and foremost important thing is that we do not overserve, that is, we request and send only the data that is necessary by determining the dimensions of the device that our application is running on. The simplest way to do this is by adding a cookie for your device size and sending it to the server along with each of the request. Once the server receives the request for the image, it can then retrieve the image based on the dimension of the image that was sent to the cookie. Most of the time these images are something like a user avatar or a list of people who commented on a certain post. We can agree that the thumbnail images do not need to be of the same size as that of the profile page, and we can save some of the bandwidth while transmitting a smaller image based on the image. Since screens these days have very high Dots Per Inch (DPI), the media that we serve to screens needs to be worthy of it. Otherwise, the application looks bad and the images look all pixelated. This can be avoided using Vector images or SVGs, which can be GZipped over the wire, thus reducing the payload size. Another not so obvious optimization is changing the image compression type. Have you ever loaded a page in which the image loads from the top to bottom in small, incremental rectangles? By default, the images are compressed using a baseline technique, which is a default method of compressing the image from top to bottom. We can change this to be progressive compression using libraries such as imagemin. This would load the entire image first as blurred, then semi blurred, and so on until the entire image is uncompressed and displayed on the screen. Uncompressing a progressive JPEG might take a little longer than that of the baseline, so it is important to measure before making such optimizations. Another extension based on this concept is a Chrome-only format of an image called WebP. This is a highly effective way of serving images, which serves a lot of companies in production and saved almost 30% on bandwidth. Using WebP is almost as simple as the progressive compression as discussed previously. We can use the imagemin-webp node module, which can convert a JPEG image into a webp image, thus reducing the image size to a great extent. Web fonts are a little different than that of images. Images get downloaded and rendered onto the UI on demand, that is, when the browser encounters the image either from the HTML 0r CSS files. However, the fonts, on the other hand, are a little different. The font files are only requested when the Render Tree is completely constructed. That means that the CSSOM and DOM have to be ready by the time request is dispatched for the fonts. Also, if the fonts files are being served from the server and not locally, then there are chances that we may see the text without the font applied first (or no text at all) and then we see the font applied, which may cause a flashing effect of the text. There are multiple simple techniques to avoid this problem: Download, serve, and preload the font files locally: <link rel="preload" href="fonts/my-font.woff2" as="font"> Specify the unicode-range in the font-face so that browsers can adapt and improvise on the character set and glyphs that are actually expected by the browser: @font-face( ... unicode-range: U+000-5FF; // latin ... ) So far, we have seen that we can get the unstyled text to be loaded on to the UI and the get styled as we expected it to be; this can be changed using the font loading API, which allows us to load and render the font using JavaScript: var font = new FontFace("myFont", "url(/my-fonts/my-font.woff2)", { unicodeRange: 'U+000-5FF' }); // initiate a fetch without Render Tree font.load().then(function() { // apply the font document.fonts.add(font); document.body.style.fontFamily = "myFont"; }); Garbage collection in JavaScript Let's take a quick look at what garbage collection (GC) is and how we can handle it in JavaScript. A lot of low-level languages provide explicit capabilities to developers to allocate and free memory in their code. However, unlike those languages, JavaScript automatically handles the memory management, which is both a good and bad thing. Good because we no longer have to worry about how much memory we need to allocate, when we need to do so, and how to free the assigned memory. The bad part about the whole process is that, to an uninformed developer, this can be a recipe for disaster and they can end up with an application that might hang and crash. Luckily for us, understanding the process of GC is quite easy and can be very easily incorporated into our coding style to make sure that we are writing optimal code when it comes to memory management. Memory management has three very obvious steps: Assign the memory to variables: var a = 10; // we assign a number to a memory location referenced by variable a Use the variables to read or write from the memory: a += 3; // we read the memory location referenced by a and write a new value to it Free the memory when it's no longer needed. Now, this is the part that is not explicit. How does the browser know when we are done with the variable a and it is ready to be garbage collected? Let's wrap this inside a function before we continue this discussion: function test() { var a = 10; a += 3; return a; } We have a very simple function, which just adds to our variable a and returns the result and finishes the execution. However, there is actually one more step, which will happen after the execution of this method called mark and sweep (not immediately after, sometimes this can also happen after a batch of operations is completed on the main thread). When the browser performs mark and sweep, it's dependent on the total memory the application consumes and the speed at which the memory is being consumed. Mark and sweep algorithm Since there is no accurate way to determine whether the data at a particular memory location is going to be used or not in the future, we will need to depend on alternatives which can help us make this decision. In JavaScript, we use the concept of a reference to determine whether a variable is still being used or not—if not, it can be garbage collected. The concept of mark and sweep is very straightforward: what all memory locations are reachable from all the known active memory locations? If something is not reachable, collect it, that is, free the memory. That's it, but what are the known active memory locations? It still needs a starting point, right? In most of the browsers, the GC algorithm keeps a list of the roots from which the mark and sweep process can be started. All the roots and their children are marked as active, and any variable that can be reached from these roots are also marked as active. Anything that cannot be reached can be marked as unreachable and thus collected. In most of the cases, the roots consist of the window object. So, we will go back to our previous example: function test() { var a = 10; a += 3; return a; } Our variable a is local to the test() method. As soon as the method is executed, there is no way to access that variable anymore, that is, no one holds any reference to that variable, and that is when it can be marked for garbage collection so that the next time GC runs, the var a will be swept and the memory allocated to it can be freed. Garbage collection and V8 When it comes to V8, the process of garbage collection is extremely complex (as it should be). So, let's briefly discuss how V8 handles it. In V8, the memory (heap) is divided into two main generations, which are the new-space and old-space. Both new-space and old-space are assigned some memory (between 1 MB and 20 MB). Most of the programs and their variables when created are assigned within the new-space. As and when we create a new variable or perform an operation, which consumes memory, it is by default assigned from the new-space, which is optimized for memory allocation. Once the total memory allocated to the new-space is almost completely consumed, the browser triggers a Minor GC, which basically removes the variables that are no longer being referenced and marks the variables that are still being referenced and cannot be removed yet. Once a variable survives two or more Minor GCs, then it becomes a candidate for old-space where the GC cycle is not run as frequently as that of the new- space. A Major GC is triggered when the old-space is of a certain size, all of this is driven by the heuristics of the application, which is very important to the whole process. So, well- written programs move fewer objects into the old-space and thus have less Major GC events being triggered. Needless to say that this is a very high-level overview of what V8 does for garbage collection, and since this process keeps changing over time, we will switch gears and move on to the next topic. Avoiding memory leaks Well, now that we know on a high level what garbage collection is in JavaScript and how it works, let's take a look at some common pitfalls which prevent us from getting our variables marked for GC by the browser. Assigning variables to global scope This should be pretty obvious by now; we discussed how the GC mechanism determines a root (which is the window object) and treats everything on the root and its children as active and never marks them for garbage collection. So, the next time you forget to add a var to your variable declarations, remember that the global variable that you are creating will live forever and never get garbage collected: function test() { a = 10; // created on window object a += 3; return a; } Removing DOM elements and references It's imperative that we keep our DOM references to a minimum, so a well-known step that we like to perform is caching the DOM elements in our JavaScript so that we do not have to query any of the DOM elements over and over. However, once the DOM elements are removed, we will need to make sure that these methods are removed from our cache as well, otherwise, they will never get GC'd: var cache = { row: document.getElementById('row') }; function removeTable() { document.body.removeChild(document.getElementById('row')); } The code shown previously removes the row from the DOM but the variable cache still refers to the DOM element, hence preventing it from being garbage collected. Another interesting thing to note here is that even when we remove the table that was containing the row, the entire table would remain in the memory and not get GC'd because the row, which is in cache internally refers to the table. Closures edge case Closures are amazing; they help us deal with a lot of problematic scenarios and also provide us with ways in which we can simulate the concept of private variables. Well, all that is good, but sometimes we tend to overlook the potential downsides that are associated with the closures. Here is what we do know and use: function myGoodFunc() { var a = new Array(10000000).join('*'); // something big enough to cause a spike in memory usage function myGoodClosure() { return a + ' added from closure'; } myGoodClosure(); } setInterval(myGoodFunc, 1000); When we run this script in the browser and then profile it, we see as expected that the method consumes a constant amount of memory and then is GC'd and restored to the baseline memory consumed by the script: Now, let's zoom into one of these spikes and take a look at the call tree to determine what all events are triggered around the time of the spikes: We can see that everything happens as per our expectation here; first, our setInterval() is triggered, which calls myGoodFunc(), and once the execution is done, there is a GC, which collects the data and hence the spike, as we can see from the preceding screenshots. Now, this was the expected flow or the happy path when dealing with closures. However, sometimes our code is not as simple and we end up performing multiple things within one closure, and sometimes even end up nesting closures: function myComplexFunc() { var a = new Array(1000000).join('*'); // something big enough to cause a spike in memory usage function closure1() { return a + ' added from closure'; } closure1(); function closure2() { console.log('closure2 called') } setInterval(closure2, 100); } setInterval(myComplexFunc, 1000); We can note in the preceding code that we extended our method to contain two closures now: closure1 and closure2. Although closure1 still performs the same operation as before, closure2 will run forever because we have it running at 1/10th of the frequency of the parent function. Also, since both the closure methods share the parent closure scope, in this case the variable a, it will never get GC'd and thus cause a huge memory leak, which can be seen from the profile as follows: On a closer look, we can see that the GC is being triggered but because of the frequency at which the methods are being called, the memory is slowly leaking (lesser memory is collected than being created): Well, that was an extreme edge case, right? It's way more theoretical than practical—why would anyone have two nested setInterval() methods with closures. Let's take a look at another example in which we no longer nest multiple setInterval(), but it is driven by the same logic. Let's assume that we have a method that creates closures: var something = null; function replaceValue () { var previousValue = something; // `unused` method loads the `previousValue` into closure scope function </span>unused() { if (previousValue) console.log("hi"); } // update something something = { str: new Array(1000000).join('*'), // all closures within replaceValue share the same // closure scope hence someMethod would have access // to previousValue which is nothing but its parent // object (`something`) // since `someMethod` has access to its parent // object, even when it is replaced by a new (identical) // object in the next setInterval iteration, the previous // value does not get garbage collected because the someMethod // on previous value still maintains reference to previousValue // and so on. someMethod: function () {} }; } setInterval(replaceValue, 1000); A simple fix to solve this problem is obvious, as we have said ourselves that the previous value of the object something doesn't get garbage collected as it refers to the previousValue from the previous iteration. So, the solution to this would be to clear out the value of the previousValue at the end of each iteration, thus leaving nothing for something to refer once it is unloaded, hence the memory profiling can be seen to change: The preceding image changes as follows: To summarize, we introduced JavaScript micro-optimizations and memory optimizations that ultimately led to a high performance JavaScript. If you have found this post useful, do check out the book Hands-On Data Structures and Algorithms with JavaScript for solutions to implement complex data structures and algorithms in practical way.

0
2
6905

Hypothesis testing with R

Compression Formats in Linux Shell Script

Type, Subtype, and Category Patterns in Logical Data Modeling

Polygon Construction - Make a 3D printed kite

How greedy algorithms work

Understanding and Developing Node Modules

Contexts and Dependency Injection in NetBeans

Technical and hidden debts in machine learning - Google engineers’ give their perspective

Map Reduce

Ruby Strings

Trending Topics

Chaos Engineering: managing complexity by breaking things

Using Node.js dependencies in NW.js

To Optimize Scans

Data Access Methods in Flex 3

6 JavaScript micro optimizations you need to know