Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-scripting-capabilities-elasticsearch
Packt
08 Jan 2016
19 min read
Save for later

The scripting Capabilities of Elasticsearch

Packt
08 Jan 2016
19 min read
In this article by Rafał Kuć and Marek Rogozinski author of the book Elasticsearch Server - Third Edition, Elasticsearch has a few functionalities in which scripts can be used. Even though scripts seem to be a rather advanced topic, we will look at the possibilities offered by Elasticsearch. That's because scripts are priceless in certain situations. Elasticsearch can use several languages for scripting. When not explicitly declared, it assumes that Groovy (http://www.groovy-lang.org/) is used. Other languages available out of the box are the Lucene expression language and Mustache (https://mustache.github.io/). Of course, we can use plugins that will make Elasticsearch understand additional scripting languages such as JavaScript, Mvel, or Python. One thing worth mentioning is this: independently from the scripting language that we will choose, Elasticsearch exposes objects that we can use in our scripts. Let's start by briefly looking at what type of information we are allowed to use in our scripts. (For more resources related to this topic, see here.) Objects available during script execution During different operations, Elasticsearch allows us to use different objects in our scripts. To develop a script that fits our use case, we should be familiar with those objects. For example, during a search operation, the following objects are available: _doc (also available as doc): An instance of the org.elasticsearch.search.lookup.LeafDocLookup object. It gives us access to the current document found with the calculated score and field values. _source: An instance of the org.elasticsearch.search.lookup.SourceLookup object. It provides access to the source of the current document and the values defined in the source. _fields: An instance of the org.elasticsearch.search.lookup.LeafFieldsLookup object. It can be used to access the values of the document fields. On the other hand, during a document update operation, the variables mentioned above are not accessible. Elasticsearch exposes only the ctx object with the _source property, which provides access to the document currently processed in the update request. As we have previously seen, several methods are mentioned in the context of document fields and their values. Let's now look at the examples of how to get the value for a particular field using the previously mentioned object available during search operations. In the brackets, you can see what Elasticsearch will return for one of our example documents from the library index (we will use the document with identifier 4): _doc.title.value (and) _source.title (crime and punishment) _fields.title.value (null) A bit confusing, isn't it? During indexing, the original document is, by default, stored in the _source field. Of course, by default, all fields are present in that _source field. In addition to this, the document is parsed, and every field may be stored in an index if it is marked as stored (that is, if the store property is set to true; otherwise, by default, the fields are not stored). Finally, the field value may be configured as indexed. This means that the field value is analyzed and placed in the index. To sum up, one field may land in an Elasticsearch index in the following ways: As part of the _source document As a stored and unparsed original value As an indexed value that is processed by an analyzer In scripts, we have access to all of these field representations. The only exception is the update operation, which—as we've mentioned before—gives us access to  only the _source document as part of the ctx variable. You may wonder which version you should use. Well, if we want access to the processed form, the answer would be simple—use the _doc object. What about _source and _fields? In most cases, _source is a good choice. It is usually fast and needs fewer disk operations than reading the original field values from the index. This is especially true when you need to read values of multiple fields in your scripts—fetching a single _source field is faster than fetching multiple independent fields from the index. Script types Elasticsearch allows us to use scripts in three different ways: Inline scripts: The source of the script is directly defined in the query In-file scripts: The source is defined in the external file placed in the Elasticsearch config/scripts directory As a document in the dedicated index: The source of the script is defined as a document in a special index available by using the /_scripts API endpoint Choosing the way of defining scripts depends on several factors. If you have scripts that you will use in many different queries, the file or the dedicated index seems to be the best solution. "Scripts in the file" is probably less convenient, but it is preferred from the security point of view—they can't be overwritten and injected into your query, which might have caused a security breach. In-file scripts This is the only way that is turned on by default in Elasticsearch. The idea is that every script used by the queries is defined in its own file placed in the config/scripts directory. We will now look at this method of using scripts. Let's create an example file called tag_sort.groovy and place it in the config/scripts directory of our Elasticsearch instance (or instances if we are running a cluster). The content of the mentioned file should look like this: _doc.tags.values.size() > 0 ? _doc.tags.values[0] : 'u19999' After a few seconds, Elasticsearch should automatically load a new file. You should see something like this in the Elasticsearch logs: [2015-08-30 13:14:33,005][INFO ][script                   ] [Alex Wilder] compiling script file [/Users/negativ/Developer/ES/es-current/config/scripts/tag_sort.groovy] If you have a multinode cluster, you have to make sure that the script is available on every node. Now we are ready to use this script in our queries. A modified query that uses our script stored in the file looks as follows: curl -XGET 'localhost:9200/library/_search?pretty' -d '{   "query" : {     "match_all" : { }   },   "sort" : {     "_script" : {       "script" : {         "file" : "tag_sort"        },        "type" : "string",        "order" : "asc"      }   } }' First, we will see the next possible way of defining a script inline. Inline scripts Inline scripts are a more convenient way of using scripts, especially for constantly changing queries or ad-hoc queries. The main drawback of such an approach is security. If we do this, we allow users to run any kind of query, including any kind of script that can be used by attackers. Such an attack can execute arbitrary code on the server running Elasticsearch with rights equal to the ones given to the user who is running Elasticsearch. In the worst-case scenario, an attacker could use security holes to gain superuser rights. This is why inline scripts are disabled by default. After careful consideration, you can enable them by adding this to the elasticsearch.yml file: script.inline: on After allowing the inline script to be executed, we can run a query that looks as follows: curl -XGET 'localhost:9200/library/_search?pretty' -d '{   "query" : {     "match_all" : { }   },   "sort" : {     "_script" : {       "script" : {         "inline" : "_doc.tags.values.size() > 0 ? _doc.tags.values[0] : "u19999""        },        "type" : "string",        "order" : "asc"      }   } }' Indexed scripts The last option for defining scripts is to store them in the dedicated Elasticsearch index. From the same security reasons, dynamic execution of indexed scripts is by default disabled. To enable indexed scripts, we have to add a configuration similar option to the one that we've added to be able to use inline scripts. We need to add the following line to the elasticsearch.yml file: script.indexed: on After adding the above property to all the nodes and restarting the cluster, we will be ready to start using indexed scripts. Elasticsearch provides additional dedicated endpoints for this purpose. Let's store our script: curl -XPOST 'localhost:9200/_scripts/groovy/tag_sort' -d '{   "script" :  "_doc.tags.values.size() > 0 ? _doc.tags.values[0] : "u19999"" }' The script is ready, but let's discuss what we just did. We sent an HTTP POST request to the special _scripts REST endpoint. We also specified the language of the script (groovy in our case) and the name of the script (tag_sort). The body of the request is the script itself. We can now move on to the query, which looks as follows: curl -XGET 'localhost:9200/library/_search?pretty' -d '{   "query" : {     "match_all" : { }   },   "sort" : {     "_script" : {       "script" : {         "id" : "tag_sort"        },        "type" : "string",        "order" : "asc"      }   } }' As we can see, this query is practically identical to the query used with the script defined in a file. The only difference is the id parameter instead of file. Querying with scripts If we look at any request made to Elasticsearch that uses scripts, we will notice some similar properties, which are as follows: script: The property that wraps the script definition. inline: The property holding the code of the script itself. id – This is the property that defines the identifier of the indexed script. file: The filename (without extension) with the script definition when the in file script is used. lang: This is the property defining the script language. If it is omitted, Elasticsearch assumes groovy. params: This is an object containing parameters and their values. Every defined parameter can be used inside the script by specifying that parameter name. Parameters allow us to write cleaner code that will be executed in a more efficient manner. Scripts that use parameters are executed faster than code with embedded constants because of caching. Scripting with parameters As our scripts become more and more complicated, the need for creating multiple, almost identical scripts can appear. Those scripts usually differ in the values used, with the logic behind them being exactly the same. In our simple example, we have used a hardcoded value to mark documents with an empty tags list. Let's change this to allow the definition of a hardcoded value. Let's use in the file script definition and create the tag_sort_with_param.groovy file with the following contents: _doc.tags.values.size() > 0 ? _doc.tags.values[0] : tvalue The only change we've made is the introduction of a parameter named tvalue, which can be set in the query in the following way: curl -XGET 'localhost:9200/library/_search?pretty' -d '{   "query" : {     "match_all" : { }   },   "sort" : {     "_script" : {       "script" : {         "file" : "tag_sort_with_param",         "params" : {           "tvalue" : "000"         }        },        "type" : "string",        "order" : "asc"      }   } }' The params section defines all the script parameters. In our simple example, we've only used a single parameter, but of course, we can have multiple parameters in a single query. Script languages The default language for scripting is Groovy. However, you are not limited to only a single scripting language when using Elasticsearch. In fact, if you would like to, you can even use Java to write your scripts. In addition to that, the community behind Elasticsearch provides support of more languages as plugins. So, if you are willing to install plugins, you can extend the list of scripting languages that Elasticsearch supports even further. You may wonder why you should even consider using a scripting language other than the default Groovy. The first reason is your own preferences. If you are a Python enthusiast, you are probably now thinking about how to use Python for your Elasticsearch scripts. The other reason could be security. When we talked about inline scripts, we told you that inline scripts are turned off by default. This is not exactly true for all the scripting languages available out of the box. Inline scripts are disabled by default when using Grooby, but you can use Lucene expressions and Mustache without any issues. This is because those languages are sandboxed, which means that security-sensitive functions are turned off. And of course, the last factor when choosing the language is performance. Theoretically, native scripts (in Java) should have better performance than others, but you should remember that the difference can be insignificant. You should always consider the cost of development and measure the performance. Using something other than embedded languages Using Groovy for scripting is a simple and sufficient solution for most use cases. However, you may have a different preference and you would like to use something different, such as JavaScript, Python, or Mvel. For now, we'll just run the following command from the Elasticsearch directory: bin/plugin install elasticsearch/elasticsearch-lang-javascript/2.7.0 The preceding command will install a plugin that will allow the use of JavaScript as the scripting language. The only change we should make in the request is putting in additional information about the language we are using for scripting. And of course, we have to modify the script itself to correctly use the new language. Look at the following example: curl -XGET 'localhost:9200/library/_search?pretty' -d '{   "query" : {     "match_all" : { }   },   "sort" : {     "_script" : {       "script" : {         "inline" : "_doc.tags.values.length > 0 ? _doc.tags.values[0] :"u19999";",         "lang" : "javascript"       },       "type" : "string",       "order" : "asc"     }   } }' As you can see, we've used JavaScript for scripting instead of the default Groovy. The lang parameter informs Elasticsearch about the language being used. Using native code If the scripts are too slow or if you don't like scripting languages, Elasticsearch allows you to write Java classes and use them instead of scripts. There are two possible ways of adding native scripts: adding classes that define scripts to the Elasticsearch classpath, or adding a script as a functionality provided by plugin. We will describe the second solution as it is more elegant. The factory implementation We need to implement at least two classes to create a new native script. The first one is a factory for our script. For now, let's focus on it. The following sample code illustrates the factory for our script: package pl.solr.elasticsearch.examples.scripts; import java.util.Map; import org.elasticsearch.common.Nullable; import org.elasticsearch.script.ExecutableScript; import org.elasticsearch.script.NativeScriptFactory; public class HashCodeSortNativeScriptFactory implements NativeScriptFactory {     @Override     public ExecutableScript newScript(@Nullable Map<String, Object> params) {         return new HashCodeSortScript(params);     }   @Override   public boolean needsScores() {     return false;   } } This class should implement the org.elasticsearch.script.NativeScriptFactory class. The interface forces us to implement two methods. The newScript() method takes the parameters defined in the API call and returns an instance of our script. Finally, needsScores() informs Elasticsearch if we want to use scoring and that it should be calculated. Implementing the native script Now let's look at the implementation of our script. The idea is simple—our script will be used for sorting. The documents will be ordered by the hashCode() value of the chosen field. Documents without a value in the defined field will be first on the results list. We know that the logic doesn't make much sense, but it is good for presentation as it is simple. The source code for our native script looks as follows: package pl.solr.elasticsearch.examples.scripts; import java.util.Map; import org.elasticsearch.script.AbstractSearchScript; public class HashCodeSortScript extends AbstractSearchScript {   private String field = "name";   public HashCodeSortScript(Map<String, Object> params) {     if (params != null && params.containsKey("field")) {       this.field = params.get("field").toString();     }   }   @Override   public Object run() {     Object value = source().get(field);     if (value != null) {       return value.hashCode();     }     return 0;   } } First of all, our class inherits from the org.elasticsearch.script.AbstractSearchScript class and implements the run() method. This is where we get the appropriate values from the current document, process it according to our strange logic, and return the result. You may notice the source() call. Yes, it is exactly the same _source parameter that we met in the non-native scripts. The doc() and fields() methods are also available, and they follow the same logic that we described earlier. The thing worth looking at is how we've used the parameters. We assume that a user can put the field parameter, telling us which document field will be used for manipulation. We also provide a default value for this parameter. The plugin definition We said that we will install our script as a part of a plugin. This is why we need additional files. The first file is the plugin initialization class, where we can tell Elasticsearch about our new script: package pl.solr.elasticsearch.examples.scripts; import org.elasticsearch.plugins.Plugin; import org.elasticsearch.script.ScriptModule; public class ScriptPlugin extends Plugin {   @Override   public String description() {    return "The example of native sort script";   }   @Override   public String name() {     return "naive-sort-plugin";   }   public void onModule(final ScriptModule module) {     module.registerScript("native_sort",       HashCodeSortNativeScriptFactory.class);   } } The implementation is easy. The description() and name() methods are only for information purposes, so let's focus on the onModule() method. In our case, we need access to script module—the Elasticsearch service connected with scripts and scripting languages. This is why we define onModule() with one ScriptModule argument. Thanks to Elasticsearch magic, we can use this module and register our script so that it can be found by the engine. We have used the registerScript() method, which takes the script name and the previously defined factory class. The second file needed is a plugin descriptor file: plugin-descriptor.properties. It defines the constants used by the Elasticsearch plugin subsystem. Without thinking more, let's look at the contents of this file: jvm=true classname=pl.solr.elasticsearch.examples.scripts.ScriptPlugin elasticsearch.version=2.0.0-beta2-SNAPSHOT version=0.0.1-SNAPSHOT name=native_script description=Example Native Scripts java.version=1.7 The appropriate lines have the following meaning: jvm: This tells Elasticsearch that our file contains Java code classname: This describes the main class with the plugin definition elasticsearch.version and java.version: They tell about the Elasticsearch and Java versions needed for our plugin name and description: These are an informative name and a short description of our plugin And that's it! We have all the files needed to fire our script. Note that now it is quite convenient to add new scripts and pack them as a single plugin. Installing a plugin Now it's time to install our native script embedded in the plugin. After packing the compiled classes as a JAR archive, we should put it into the Elasticsearch plugins/native-script directory. The native-script part is a root directory for our plugin and you may name it as you wish. In this directory, you also need the prepared plugin-descriptor.properties file. This makes our plugin visible to Elasticsearch. Running the script After restarting Elasticsearch (or the entire cluster if you are running more than a single node), we can start sending the queries that use our native script. For example, we will send a query that uses our previously indexed data from the library index. This example query looks as follows: curl -XGET 'localhost:9200/library/_search?pretty' -d '{   "query" : {     "match_all" : { }   },   "sort" : {     "_script" : {       "script" : {         "script" : "native_sort",         "lang" : "native",         "params" : {           "field" : "otitle"         }       },       "type" : "string",       "order" : "asc"     }   } }' Note the params part of the query. In this call, we want to sort on the otitle field. We provide the script name as native_sort and the script language as native. This is required. If everything goes well, we should see our results sorted by our custom sort logic. If we look at the response from Elasticsearch, we will see that documents without the otitle field are at the first few positions of the results list and their sort value is 0. Summary In this article, we focused on querying, but not about the matching part of it—mostly about scoring. You learned how Apache Lucene TF/IDF scoring works. We saw the scripting capabilities of Elasticsearch and handled multilingual data. We also used boosting to influence how scores of returned documents were calculated and we used synonyms. Finally, we used explain information to see how document scores were calculated by query. Resources for Article:   Further resources on this subject: An Introduction to Kibana [article] Indexing the Data [article] Low-Level Index Control [article]
Read more
  • 0
  • 0
  • 4748

article-image-advanced-shiny-functions
Packt
08 Jan 2016
14 min read
Save for later

Advanced Shiny Functions

Packt
08 Jan 2016
14 min read
In this article by Chris Beeley, author of the book, Web Application Development with R using Shiny - Second Edition, we are going to extend our toolkit by learning about advanced Shiny functions. These allow you to take control of the fine details of your application, including the interface, reactivity, data, and graphics. We will cover the following topics: Learn how to show and hide parts of the interface Change the interface reactively Finely control reactivity, so functions and outputs run at the appropriate time Use URLs and reactive Shiny functions to populate and alter the selections within an interface Upload and download data to and from a Shiny application Produce beautiful tables with the DataTables jQuery library (For more resources related to this topic, see here.) Summary of the application We're going to add a lot of new functionality to the application, and it won't be possible to explain every piece of code before we encounter it. Several of the new functions depend on at least one other function, which means that you will see some of the functions for the first time, whereas a different function is being introduced. It's important, therefore, that you concentrate on whichever function is being explained and wait until later in the article to understand the whole piece of code. In order to help you understand what the code does as you go along it is worth quickly reviewing the actual functionality of the application now. In terms of the functionality, which has been added to the application, it is now possible to select not only the network domain from which browser hits originate but also the country of origin. The draw map function now features a button in the UI, which prevents the application from updating the map each time new data is selected, the map is redrawn only when the button is pushed. This is to prevent minor updates to the data from wasting processor time before the user has finished making their final data selection. A Download report button has been added, which sends some of the output as a static file to a new webpage for the user to print or download. An animated graph of trend has been added; this will be explained in detail in the relevant section. Finally, a table of data has been added, which summarizes mean values of each of the selectable data summaries across the different countries of origin. Downloading data from RGoogleAnalytics The code is given and briefly summarized to give you a feeling for how to use it in the following section. Note that my username and password have been replaced with XXXX; you can get your own user details from the Google Analytics website. Also, note that this code is not included on the GitHub because it requires the username and password to be present in order for it to work: library(RGoogleAnalytics) ### Generate the oauth_token object oauth_token <- Auth(client.id = "xxxx", client.secret = "xxxx") # Save the token object for future sessions save(oauth_token, file = "oauth_token") Once you have your client.id and client.secret from the Google Analytics website, the preceding code will direct you to a browser to authenticate the application and save the authorization within oauth_token. This can be loaded in future sessions to save from reauthenticating each time as follows: # Load the token object and validate for new run load("oauth_token") ValidateToken(oauth_token) The preceding code will load the token in subsequent sessions. The validateToken() function is necessary each time because the authorization will expire after a time this function will renew the authentication: ## list of metrics and dimensions query.list <- Init(start.date = "2013-01-01", end.date = as.character(Sys.Date()), dimensions = "ga:country,ga:latitude,ga:longitude, ga:networkDomain,ga:date", metrics = "ga:users,ga:newUsers,ga:sessions, ga:bounceRate,ga:sessionDuration", max.results = 10000, table.id = "ga:71364313") gadf = GetReportData(QueryBuilder(query.list), token = oauth_token, paginate_query = FALSE) Finally, the metrics and dimensions of interest (for more on metrics and dimensions, see the documentation of the Google Analytics API online) are placed within a list and downloaded with the GetReportData() function as follows: ...[data tidying functions]... save(gadf, file = "gadf.Rdata") The data tidying that is carried out at the end is omitted here for brevity, as you can see at the end the data is saved as gadf.Rdata ready to load within the application. Animation Animation is surprisingly easy. The sliderInput() function, which gives an HTML widget that allows the selection of a number along a line, has an optional animation function that will increment a variable by a set amount every time a specified unit of time elapses. This allows you to very easily produce a graphic that animates. In the following example, we are going to look at the monthly graph and plot a linear trend line through the first 20% of the data (0–20% of the data). Then, we are going to increment the percentage value that selects the portion of the data by 5% and plot a linear through that portion of data (5–25% of the data). Then, increment again from 10% to 30% and plot another line and so on. There is a static image in the following screenshot: The slider input is set up as follows, with an ID, label, minimum value, maximum value, initial value, step between values, and the animation options, giving the delay in milliseconds and whether the animation should loop: sliderInput("animation", "Trend over time", min = 0, max = 80, value = 0, step = 5, animate = animationOptions(interval = 1000, loop = TRUE) ) Having set this up, the animated graph code is pretty simple, looking very much like the monthly graph data except with the linear smooth based on a subset of the data instead of the whole dataset. The graph is set up as before and then a subset of the data is produced on which the linear smooth can be based: groupByDate <- group_by(passData(), YearMonth, networkDomain) %>% summarise(meanSession = mean(sessionDuration, na.rm = TRUE), users = sum(users), newUsers = sum(newUsers), sessions = sum(sessions)) groupByDate$Date <- as.Date(paste0(groupByDate$YearMonth, "01"), format = "%Y%m%d") smoothData <- groupByDate[groupByDate$Date %in% quantile(groupByDate$Date, input$animation / 100, type = 1): quantile(groupByDate$Date, (input$animation + 20) / 100, type = 1), ] We won't get too distracted by this code, but essentially, it tests to see which of the whole date range falls in a range defined by percentage quantiles based on the sliderInput() values. See ?quantile for more information. Finally, the linear smooth is drawn with an extra data argument to tell ggplot2 to base the line only on the smaller smoothData object and not the whole range: ggplot(groupByDate, aes_string(x = "Date", y = input$outputRequired, group = "networkDomain", colour = "networkDomain") ) + geom_line() + geom_smooth(data = smoothData, method = "lm", colour = "black" ) Not bad for a few lines of code. We have both ggplot2 and Shiny to thank for how easy this is. Streamline the UI by hiding elements This is a simple function that you are certainly going to need if you build even a moderately complex application. Those of you who have been doing extra credit exercises and/or experimenting with your own applications will probably have already wished for this or, indeed, have already found it. conditionalPanel() allows you to show/hide UI elements based on other selections within the UI. The function takes a condition (in JavaScript, but the form and syntax will be familiar from many languages) and a UI element and displays the UI only when the condition is true. This has actually used a couple of times in the advanced GA application, and indeed in all the applications, I've ever written of even moderate complexity. We're going to show the option to smooth the trend graph only when the trend graph tab is displayed, and we're going to show the controls for the animated graph only when the animated graph tab is displayed. Naming tabPanel elements In order to allow testing for which tab is currently selected, we're going to have to first give the tabs of the tabbed output names. This is done as follows (with the new code in bold): tabsetPanel(id = "theTabs", # give tabsetPanel a name tabPanel("Summary", textOutput("textDisplay"), value = "summary"), tabPanel("Trend", plotOutput("trend"), value = "trend"), tabPanel("Animated", plotOutput("animated"), value = "animated"), tabPanel("Map", plotOutput("ggplotMap"), value = "map"), tabPanel("Table", DT::dataTableOutput("countryTable"), value = "table") As you can see, the whole panel is given an ID (theTabs) and then each tabPanel is also given a name (summary, trend, animated, map, and table). They are referred to in the server.R file very simply as input$theTabs. Finally, we can make our changes to ui.R to remove parts of the UI based on tab selection: conditionalPanel( condition = "input.theTabs == 'trend'", checkboxInput("smooth", label = "Add smoother?", # add smoother value = FALSE) ), conditionalPanel( condition = "input.theTabs == 'animated'", sliderInput("animation", "Trend over time", min = 0, max = 80, value = 0, step = 5, animate = animationOptions(interval = 1000, loop = TRUE) ) ) As you can see, the condition appears very R/Shiny-like, except with the . operator familiar to JavaScript users in place of $. This is a very simple but powerful way of making sure that your UI is not cluttered with an irrelevant material. Beautiful tables with DataTable The latest version of Shiny has added support to draw tables using the wonderful DataTables jQuery library. This will enable your users to search and sort through large tables very easily. To see DataTable in action, visit the homepage at http://datatables.net/. The version in this application summarizes the values of different variables across the different countries from which browser hits originate and looks as follows: The package can be installed using install.packages("DT") and needs to be loaded in the preamble to the server.R file with library(DT). Once this is done using the package is quite straightforward. There are two functions: one in server.R (renderDataTable) and other in ui.R (dataTableOutput). They are used as following: ### server. R output$countryTable <- DT::renderDataTable ({ groupCountry <- group_by(passData(), country) groupByCountry <- summarise(groupCountry, meanSession = mean(sessionDuration), users = log(sum(users)), sessions = log(sum(sessions)) ) datatable(groupByCountry) }) ### ui.R tabPanel("Table", DT::dataTableOutput("countryTable"), value = "table") Anything that returns a dataframe or a matrix can be used within renderDataTable(). Note that as of Shiny V. 0.12, the Shiny functions renderDataTable() and dataTableOutput() functions are deprecated: you should use the DT equivalents of the same name, as in the preceding code adding DT:: before each function name specifies that the function should be drawn from that package. Reactive user interfaces Another trick you will definitely want up your sleeve at some point is a reactive user interface. This enables you to change your UI (for example, the number or content of radio buttons) based on reactive functions. For example, consider an application that I wrote related to survey responses across a broad range of health services in different areas. The services are related to each other in quite a complex hierarchy, and over time, different areas and services respond (or cease to exist, or merge, or change their name), which means that for each time period the user might be interested in, there would be a totally different set of areas and services. The only sensible solution to this problem is to have the user tell you which area and date range they are interested in and then give them back the correct list of services that have survey responses within that area and date range. The example we're going to look at is a little simpler than this, just to keep from getting bogged down in too much detail, but the principle is exactly the same, and you should not find this idea too difficult to adapt to your own UI. We are going to allow users to constrain their data by the country of origin of the browser hit. Although we could design the UI by simply taking all the countries that exist in the entire dataset and placing them all in a combo box to be selected, it is a lot cleaner to only allow the user to select from the countries that are actually present within the particular date range they have selected. This has the added advantage of preventing the user from selecting any countries of origin, which do not have any browser hits within the currently selected dataset. In order to do this, we are going to create a reactive user interface, that is, one that changes based on data values that come about from user input. Reactive user interface example – server.R When you are making a reactive user interface, the big difference is that instead of writing your UI definition in your ui.R file, you place it in server.R and wrap it in renderUI(). Then, point to it from your ui.R file. Let's have a look at the relevant bit of the server.R file: output$reactCountries <- renderUI({ countryList = unique(as.character(passData()$country)) selectInput("theCountries", "Choose country", countryList) }) The first line takes the reactive dataset that contains only the data between the dates selected by the user and gives all the unique values of countries within it. The second line is a widget type we have not used yet, which generates a combo box. The usual id and label arguments are given, followed by the values that the combo box can take. This is taken from the variable defined in the first line. Reactive user interface example – ui.R The ui.R file merely needs to point to the reactive definition, as shown in the following line of code (just add it in to the list of widgets within sidebarPanel()): uiOutput("reactCountries") You can now point to the value of the widget in the usual way as input$subDomains. Note that you do not use the name as defined in the call to renderUI(), that is, reactCountries, but rather the name as defined within it, that is, theCountries. Progress bars It is quite common within Shiny applications and in analytics generally to have computations or data fetches that take a long time. However, even using all these tools, it will sometimes be necessary for the user to wait some time before their output is returned. In cases like this, it is a good practice to do two things: first, to inform that the server is processing the request and has not simply crashed or otherwise failed, and second to give the user some idea of how much time has elapsed since they requested the output and how much time they have remaining to wait. This is achieved very simply in Shiny using the withProgress() function. This function defaults to measuring progress on a scale from 0 to 1 and produces a loading bar at the top of the application with the information from the message and detail arguments of the loading function. You can see in the following code, the withProgress function is used to wrap a function (in this case, the function that draws the map), with message and detail arguments describing what is happened and an initial value of 0 (value = 0, that is, no progress yet): withProgress(message = 'Please wait', detail = 'Drawing map...', value = 0, { ... function code... } ) As the code is stepped through, the value of progress can steadily be increased from 0 to 1 (for example, in a for() loop) using the following code: incProgress(1/3) The third time this is called, the value of progress will be 1, which indicates that the function has completed (although other values of progress can be selected where necessary, see ?withProgess()). To summarize, the finished code looks as follows: withProgress(message = 'Please wait', detail = 'Drawing map...', value = 0, { ... function code... incProgress(1/3) .. . function code... incProgress(1/3) ... function code... incProgress(1/3) } ) It's very simple. Again, have a look at the application to see it in action. Summary In this article, you have now seen most of the functionality within Shiny. It's a relatively small but powerful toolbox with which you can build a vast array of useful and intuitive applications with comparatively little effort. In this respect, ggplot2 is rather a good companion for Shiny because it too offers you a fairly limited selection of functions with which knowledgeable users can very quickly build many different graphical outputs. Resources for Article: Further resources on this subject: Introducing R, RStudio, and Shiny[article] Introducing Bayesian Inference[article] R ─ Classification and Regression Trees[article]
Read more
  • 0
  • 0
  • 3537

article-image-understanding-elf-specimen
Packt
07 Jan 2016
21 min read
Save for later

Understanding the ELF specimen

Packt
07 Jan 2016
21 min read
In this article by Ryan O'Neill, author of the book Learning Linux Binary Analysis, ELF will be discussed. In order to reverse-engineer Linux binaries, we must understand the binary format itself. ELF has become the standard binary format for UNIX and UNIX-Flavor OS's. Binary formats such as ELF are not generally a quick study, and to learn ELF requires some degree of application of the different components that you learn as you go. Programming things that perform tasks such as binary parsing will require learning some ELF, and in the act of programming such things, you will in-turn learn ELF better and more proficiently as you go along. ELF is often thought to be a dry and complicated topic to learn, and if one were to simply read through the ELF specs without applying them through the spirit of creativity, then indeed it would be. ELF is really quite an incredible composition of computer science at work, with program layout, program loading, dynamic linking, symbol table lookups, and many other tightly orchestrated components. (For more resources related to this topic, see here.) ELF section headers Now that we've looked at what program headers are, it is time to look at section headers. I really want to point out here the distinction between the two; I often hear people calling sections "segments" and vice versa. A section is not a segment. Segments are necessary for program execution, and within segments are contained different types of code and data which are separated within sections and these sections always exist, and usually they are addressable through something called section-headers. Section-headers are what make sections accessible, but if the section-headers are stripped (Missing from the binary), it doesn't mean that the sections are not there. Sections are just data or code. This data or code is organized across the binary in different sections. The sections themselves exist within the boundaries of the text and data segment. Each section contains either code or data of some type. The data could range from program data, such as global variables, or dynamic linking information that is necessary for the linker. Now, as mentioned earlier, every ELF object has sections, but not all ELF objects have section headers. Usually this is because the executable has been tampered with (Such as the section headers having been stripped so that debugging is harder). All of GNU's binutils, such as objcopy, objdump, and other tools such as gdb, rely on the section-headers to locate symbol information that is stored in the sections specific to containing symbol data. Without section-headers, tools such as gdb and objdump are nearly useless. Section-headers are convenient to have for granular inspection over what parts or sections of an ELF object we are viewing. In fact, section-headers make reverse engineering a lot easier, since they provide us with the ability to use certain tools that require them. If, for instance, the section-header table is stripped, then we can't access a section such as .dynsym, which contains imported/exported symbols describing function names and offsets/addresses. Even if a section-header table has been stripped from an executable, a moderate reverse engineer can actually reconstruct a section-header table (and even part of a symbol table) by getting information from certain program headers, since these will always exist in a program or shared library. We discussed the dynamic segment earlier and the different DT_TAGs that contain information about the symbol table and relocation entries. This is what a 32 bit ELF section-header looks like: typedef struct {     uint32_t   sh_name; // offset into shdr string table for shdr name     uint32_t   sh_type; // shdr type I.E SHT_PROGBITS     uint32_t   sh_flags; // shdr flags I.E SHT_WRITE|SHT_ALLOC     Elf32_Addr sh_addr;  // address of where section begins     Elf32_Off  sh_offset; // offset of shdr from beginning of file     uint32_t   sh_size;   // size that section takes up on disk     uint32_t   sh_link;   // points to another section     uint32_t   sh_info;   // interpretation depends on section type     uint32_t   sh_addralign; // alignment for address of section     uint32_t   sh_entsize;  // size of each certain entries that may be in section } Elf32_Shdr; Let's take a look at some of the most important section types, once again allowing room to study the ELF(5) man pages and the official ELF specification for more detailed information about the sections. .text The .text section is a code section that contains program code instructions. In an executable program where there are also phdr, this section would be within the range of the text segment. Because it contains program code, it is of the section type SHT_PROGBITS. .rodata The rodata section contains read-only data, such as strings, from a line of C code: printf("Hello World!n"); These are stored in this section. This section is read-only, and therefore must exist in a read-only segment of an executable. So, you would find .rodata within the range of the text segment (Not the data segment). Because this section is read-only, it is of the type SHT_PROGBITS. .plt The Procedure linkage table (PLT) contains code that is necessary for the dynamic linker to call functions that are imported from shared libraries. It resides in the text segment and contains code, so it is marked as type SHT_PROGBITS. .data The data section, not to be confused with the data segment, will exist within the data segment and contain data such as initialized global variables. It contains program variable data, so it is marked as SHT_PROGBITS. .bss The bss section contains uninitialized global data as part of the data segment, and therefore takes up no space on the disk other than 4 bytes, which represents the section itself. The data is initialized to zero at program-load time, and the data can be assigned values during program execution. The bss section is marked as SHT_NOBITS, since it contains no actual data. .got The Global offset table (GOT) section contains the global offset table. This works together with the PLT to provide access to imported shared library functions, and is modified by the dynamic linker at runtime. This section has to do with program execution and is therefore marked as SHT_PROGBITS. .dynsym The dynsym section contains dynamic symbol information imported from shared libraries. It is contained within the text segment and is marked as type SHT_DYNSYM. .dynstr The dynstr section contains the string table for dynamic symbols; this has the name of each symbol in a series of null terminated strings. .rel.* Relocation sections contain information about how the parts of an ELF object or process image need to be fixed up or modified at linking or runtime. .hash The hash section, sometimes called .gnu.hash, contains a hash table for symbol lookup. The following hash algorithm is used for symbol name lookups in Linux ELF: uint32_t dl_new_hash (const char *s) {         uint32_t h = 5381;           for (unsigned char c = *s; c != ' '; c = *++s)                 h = h * 33 + c;           return h; } .symtab The symtab section contains symbol information of the type ElfN_Sym. The symtab section is marked as type SHT_SYMTAB as it contains symbol information. .strtab This section contains the symbol string table that is referenced by the st_name entries within the ElfN_Sym structs of .symtab, and is marked as type SHT_STRTAB since it contains a string table. .shstrtab The shstrtab section contains the section header string table, which is a set of null terminated strings containing the names of each section, such as .text, .data, and so on. This section is pointed to by the ELF file header entry called e_shstrndx, which holds the offset of .shstrtab. This section is marked as SHT_STRTAB since it contains a string table. .ctors and .dtors The .ctors (constructors) and .dtors (destructors) sections contain code for initialization and finalization, which is to be executed before and after the actual main() body of program code, and then after the main program code. The __constructor__ function attribute is often used by hackers and virus writers to implement a function that performs an anti-debugging trick, such as calling PTRACE_TRACEME, so that the process traces itself and no debuggers can attach themselves to it. This way, the anti-debugging mechanism gets executed before the program enters main(). There are many other section names and types, but we have covered most of the primary ones found in a dynamically linked executable. One can now visualize how an executable is laid out with both phdrs and shdrs: ELF Relocations From the ELF(5) man pages: Relocation is the process of connecting symbolic references with symbolic definitions.  Relocatable files must have information that describes how to modify their section contents, thus allowing executable and shared object files to hold the right information for a process's program image. Relocation entries are these data. The process of relocation relies on symbols, which is why we covered symbols first. An example of relocation might be a couple of relocatable objects (ET_REL) being linked together to create an executable. obj1.o wants to call a function, foo(), located in obj2.o. Both obj1.o and obj2.o are being linked to create a fully working executable; they are currently Position independent code (PIC), but once relocated to form an executable, they will no longer be position independent since symbolic references will be resolved into symbolic definitions. The term "relocated" means exactly that: a piece of code or data is being relocated from a simple offset in an object file to some memory address location in an executable, and anything that references that relocated code or data must also be adjusted. Let's take a quick look at a 32 bit relocation entry: typedef struct {     Elf32_Addr r_offset;     uint32_t   r_info; } Elf32_Rel; And some relocation entries require an addend: typedef struct {     Elf32_Addr r_offset;     uint32_t   r_info;     int32_t    r_addend; } Elf32_Rela; Following is the description of the preceding snippet: r_offset: This points to the location (offset or address) that requires the relocation action (which is going to be some type of modification) r_info: This gives both the symbol table index with respect to which the relocation must be made, and the type of relocation to apply r_addend: This specifies a constant addend used to compute the value stored in the relocatable field Let's take a look at the source code for obj1.o: _start() {   foo(); } We see that it calls the function foo(), however foo() is not located within the source code or the compiled object file, so there will be a relocation entry necessary for symbolic reference: ryan@alchemy:~$ objdump -d obj1.o obj1.o:     file format elf32-i386 Disassembly of section .text: 00000000 <func>:    0:  55                     push   %ebp    1:  89 e5                  mov    %esp,%ebp    3:  83 ec 08               sub    $0x8,%esp    6:  e8 fc ff ff ff         call   7 <func+0x7>    b:  c9                     leave     c:  c3                     ret   As we can see, the call to foo() is highlighted and simply calls to nowhere; 7 is the offset of itself. So, when obj1.o, which calls foo() (located in obj2.o), is linked with obj2.o to make an executable, a relocation entry is there to point at offset 7, which is the data that needs to be modified, changing it to the offset of the actual function, foo(), once the linker knows its location in the executable during link time: ryan@alchemy:~$ readelf -r obj1.o Relocation section '.rel.text' at offset 0x394 contains 1 entries:  Offset     Info    Type            Sym.Value  Sym. Name 00000007  00000902 R_386_PC32        00000000   foo As we can see, a relocation field at offset 7 is specified by the relocation entry's r_offset field. R_386_PC32 is the relocation type; to understand all of these types, read the ELF specs as we will only be covering some. Each relocation type requires a different computation on the relocation target being modified. R_386_PC32 says to modify the target with S + A – P. The following list explains all these terms: S is the value of the symbol whose index resides in the relocation entry A is the addend found in the relocation entry P is the place (section offset or address) where the storage unit is being relocated (computed using r_offset) If we look at the final output of our executable after compiling obj1.o and obj2.o, as shown in the following code snippet: ryan@alchemy:~$ gcc -nostdlib obj1.o obj2.o -o relocated ryan@alchemy:~$ objdump -d relocated test:     file format elf32-i386 Disassembly of section .text: 080480d8 <func>:  80480d8:  55                     push   %ebp  80480d9:  89 e5                  mov    %esp,%ebp  80480db:  83 ec 08               sub    $0x8,%esp  80480de:  e8 05 00 00 00         call   80480e8 <foo>  80480e3:  c9                     leave   80480e4:  c3                     ret     80480e5:  90                     nop  80480e6:  90                     nop  80480e7:  90                     nop 080480e8 <foo>:  80480e8:  55                     push   %ebp  80480e9:  89 e5                  mov    %esp,%ebp  80480eb:  5d                     pop    %ebp  80480ec:  c3                     ret We can see that the call instruction (the relocation target) at 0x80480de has been modified with the 32 bit offset value of 5, which points to foo(). The value 5 is the result of the R386_PC_32 relocation action: S + A – P: 0x80480e8 + 0xfffffffc – 0x80480df = 5 0xfffffffc is the same as -4 if a signed integer, so the calculation can also be seen as: 0x80480e8 + (0x80480df + sizeof(uint32_t)) To calculate an offset into a virtual address, use the following computation: address_of_call + offset + 5 (Where 5 is the length of the call instruction) Which in this case is 0x80480de + 5 + 5 = 0x80480e8. An address may also be computed into an offset with the following computation: address – address_of_call – 4 (Where 4 is the length of a call instruction – 1) Relocatable code injection based binary patching Relocatable code injection is a technique that hackers, virus writers, or anyone who wants to modify the code in a binary may utilize as a way to sort of re-link a binary after it has already been compiled. That is, you can inject an object file into an executable, update the executables symbol table, and perform the necessary relocations on the injected object code so that it becomes a part of the executable. A complicated virus might use this rather than just appending code at the end of an executable or finding existing padding. This technique requires extending the text segment to create enough padding room to load the object file. The real trick though is handling the relocations and applying them properly. I designed a custom reverse engineering tool for ELF that is named Quenya. Quenya has many features and capabilities, and one of them is to inject object code into an executable. Why do this? Well, one reason would be to inject a malicious function into an executable, and then hijack a legitimate function and replace it with the malicious one. From a security point of view, one could do hot-patching and apply a legitimate patch to a binary rather than doing something malicious. Let's pretend we are an attacker and we want to infect a program that calls puts() to print "Hello World", and our goal is to hijack puts() so that it calls evil_puts(). First, we would need to write a quick PIC object that can write a string to standard output: #include <sys/syscall.h> int _write (int fd, void *buf, int count) {   long ret;     __asm__ __volatile__ ("pushl %%ebxnt"                         "movl %%esi,%%ebxnt"                         "int $0x80nt" "popl %%ebx":"=a" (ret)                         :"0" (SYS_write), "S" ((long) fd),                         "c" ((long) buf), "d" ((long) count));   if (ret >= 0) {     return (int) ret;   }   return -1; } int evil_puts(void) {         _write(1, "HAHA puts() has been hijacked!n", 31); } Now, we compile evil_puts.c into evil_puts.o, and inject it into our program, hello_world: ryan@alchemy:~/quenya$ ./hello_world Hello World This program calls the following: puts(“Hello Worldn”); We now use Quenya to inject and relocate our evil_puts.o file into hello_world: [Quenya v0.1@alchemy] reloc evil_puts.o hello_world 0x08048624  addr: 0x8048612 0x080485c4 _write addr: 0x804861e 0x080485c4  addr: 0x804868f 0x080485c4  addr: 0x80486b7 Injection/Relocation succeeded As we can see, the function write() from our evil_puts.o has been relocated and assigned an address at 0x804861e in the executable, hello_world. The next command, hijack, overwrites the global offset table entry for puts() with the address of evil_puts(): [Quenya v0.1@alchemy] hijack binary hello_world evil_puts puts Attempting to hijack function: puts Modifying GOT entry for puts Succesfully hijacked function: puts Commiting changes into executable file [Quenya v0.1@alchemy] quit And Whammi! ryan@alchemy:~/quenya$ ./hello_world HAHA puts() has been hijacked! We have successfully relocated an object file into an executable and modified the executable's control flow so that it executes the code that we injected. If we use readelf -s on hello_world, we can actually now see a symbol called evil_puts(). For the readers interest, I have included a small snippet of code that contains the ELF relocation mechanics in Quenya; it may be a little bit obscure without knowledge of the rest of the code base, but it is also somewhat straightforward if you've paid attention to what we learned about relocations. It is just a snippet and does not show any of the other important aspects such as modifying the executables symbol table: case SHT_RELA: for (j = 0; j < obj.shdr[i].sh_size / sizeof(Elf32_Rela); j++, rela++) {   rela = (Elf32_Rela *)(obj.mem + obj.shdr[i].sh_offset);       /* symbol table */                            symtab = (Elf32_Sym *)obj.section[obj.shdr[i].sh_link];               /* symbol we are applying relocation to */       symbol = &symtab[ELF32_R_SYM(rela->r_info)];        /* section to modify */       TargetSection = &obj.shdr[obj.shdr[i].sh_info];       TargetIndex = obj.shdr[i].sh_info;        /* target location */       TargetAddr = TargetSection->sh_addr + rela->r_offset;              /* pointer to relocation target */       RelocPtr = (Elf32_Addr *)(obj.section[TargetIndex] + rela->r_offset);              /* relocation value */       RelVal = symbol->st_value;       RelVal += obj.shdr[symbol->st_shndx].sh_addr;       switch (ELF32_R_TYPE(rela->r_info))       {         /* R_386_PC32      2    word32  S + A - P */         case R_386_PC32:               *RelocPtr += RelVal;                   *RelocPtr += rela->r_addend;                   *RelocPtr -= TargetAddr;                   break;              /* R_386_32        1    word32  S + A */            case R_386_32:                *RelocPtr += RelVal;                   *RelocPtr += rela->r_addend;                   break;       }  } As shown in the preceding code, the relocation target that RelocPtr points to is modified according to the relocation action requested by the relocation type (such as R_386_32). Although relocatable code binary injection is a good example of the idea behind relocations, it is not a perfect example of how a linker actually performs it with multiple object files. Nevertheless, it still retains the general idea and application of a relocation action. Later on, we will talk about shared library (ET_DYN) injection, which brings us now to the topic of dynamic linking. Summary In this article we discussed different types of ELF section headers and ELF relocations. Resources for Article:   Further resources on this subject: Advanced User Management [article] Creating Code Snippets [article] Advanced Wireless Sniffing [article]
Read more
  • 0
  • 0
  • 3072
Visually different images

article-image-different-ir-algorithms-you-will-learn
Packt
04 Jan 2016
23 min read
Save for later

Different IR Algorithms

Packt
04 Jan 2016
23 min read
In this article written by Sudipta Mukherjee, author of the book F# for Machine Learning, we learn about how information overload is almost a passé term; however, it is still valid. Information retrieval is a big arena and most of it is far from being solved. However, that being said, we have come a long way and the results produced by some of the state of the art information retrieval algorithms are really impressive. You may not know that you are using information retrieval but whenever you search for some documents on your PC or on the internet, you are actually using the produce of some information retrieval algorithm at the background. So as the metaphor goes, finding the needle (read information/insight) in a haystack (read your data archive on your PC or on the web) is the key to successful business. Distance based: Two documents are matched based on their proximity, calculated by several distance metric on the vector representation of the document Set based: Two documents are matched based on their proximity, calculated by several set based / fuzzy set based, metric based on the bag of words (BoW) model of the document (For more resources related to this topic, see here.) What are some interesting things that you can do? You will learn how the same algorithm can find similar biscuits and identify author of digital documents from the words authors use. You will also learn how IR distance metrics can be used to group color images. Information retrieval using tf-idf Whenever you type some search term in your "Windows" search box, some documents appear matching your search term. There is a common, well-known, easy-to-implement algorithm that makes it possible to rank the documents based on the search term. Basically, the algorithm allows the developers to assign some kind of score to each document in the result set. That score can be seen as a score of confidence that the system has on how much the user would like that result. The score that this algorithm attaches with each document is a product of two different scores. The first one is called term frequency (tf) and the other one is called inverse document frequency (idf). Their product is referred to as tf-idf or "term frequency inverse document frequency". Tf is the number of times some term occurs in a given document. Idf is the ratio between the total number of documents scanned and the number of documents in which a given search term is found. However, this ration is not used as is. Log of this ration is used as idf, as shown next. The following is a term frequency and inverse term frequency example for the word "example": This is the same as: Idf is normally calculated with the following formula: The following is the code that demonstrates how to find tf-idf score for the given search terms; in this case, "example". Sentences are fabricated to match up the desired number of count of the word "example" in document 2; in this case, "sentence2". Here  denotes the set of all the documents and  denotes a second document. let sentence1 = "this is a a sample" let sentence2 = "this is another another example example example" let word = "example" let numberOfDocs = 2. let tf1 = sentence1.Split ' ' |> Array.filter ( fun t -> t = word) |> Array.length let tf2 = sentence2.Split ' ' |> Array.filter ( fun t -> t = word) |> Array.length let docs = [|sentence1;sentence2|] let foundIn = docs |> Array.map ( fun t -> t.Split ' '                                             |> Array.filter ( fun z -> z = word))                                             |> Array.filter ( fun m -> m |> Array.length <> 0)                                             |> Array.length let idf =  Operators.log10 ( numberOfDocs / float foundIn) let pr1 = float  tf1 * idf let pr2 = float tf2 * idf printfn "%f %f" pr1 pr2 This produces the following output: 0.0 ­ 0.903090 This means that the second document is more closely related with the word "example" than the first one. In this case, this is one of the extreme cases where the word doesn't appear at all in one document and appears three times in the other. However, with the same word occurring multiple times in both the documents, you will get different scores for each document. You can think of these scores as the confidence scores for association of the word and the document. More the score, more is the confidence that the document is bound to have something related to that word. Measures of similarity In the following section, you will create a framework for finding several distance measures. A distance between two probability distribution functions (pdf) is an important way to know how close two entities are. One way to generate the pdfs from histogram is to normalize the histogram. Generating a PDF from a histogram A histogram holds the number of times a value occurred. For example, a text can be represented as a histogram where the histogram values represent the number of times a word appears in the text. For gray images, it can be the number of times each gray scale appears in the image. In the following section, you will build a few modules that hold several distance metrics. A distance metric is a measure of how similar two objects are. It can also be called a measure of proximity. The following metrics use the notation of  and  to denote the ith value of either PDF. Let's say we have a histogram denoted by , and there are  elements in the histogram. Then a rough estimate of  is . The following function does this transformation from histogram to pdf: let toPdf (histogram:int list)=         histogram |> List.map (fun t -> float t / float histogram.Length ) There are a couple of important assumptions made. First, we assume that the number of bins is equal to the number of elements. So the histogram and the pdf will have the same number of elements. That's not exactly correct in mathematical sense. All the elements of a pdf should add up to 1. The following implementation of histToPdf guarantees that but it's not a good choice as it is not normalization. So resist the temptation to use this one. let histToPdf (histogram:int list)=         let sum = histogram |> List.sum         histogram |> List.map (fun t -> float t / float sum) Generating a histogram from a list of elements is simple. Following is the function that takes a list and returns the histogram. F# has the function already built in; it is called countBy. let listToHist (l : int list)=         l |> List.countBy (fun t -> t) Next is an example of how using these two functions, a list of integer can be transformed into a pdf. The following method takes a list of integers and returns the probability distribution associated: let listToPdf (aList : int list)=         aList |> List.countBy (fun t -> t)               |> List.map snd               |> toPdf Here is how you can use it: let list = [1;1;1;2;3;4;5;1;2] let pdf = list |> listToPdf I have captured the following in the F# interactive and got the following histogram from the preceding list: val it : (int * int) list = [(1, 4); (2, 2); (3, 1); (4, 1); (5, 1)] If you just project the second element for this histogram and store it in an int list, then you can represent the histogram in a list. So for this example, the histogram can be represented as: let hist = [4;2;1;1;1] Distance metrics are classified among several families based on their structural similarity. The sections following show how to work on these metrics using F# and uses histograms as int list as input.  and  denote the vectors that represent the entity being compared. For example, for the document retrieval system, these numbers might indicate the number of times a given word occurred in each of the documents that are being compared.  denotes the i element of the vector represented by  and  denotes the ith element of the vector represented by . Some literature call these vectors the profile vectors. Minkowski family As you can see, when two pdfs are almost the same then this family of distance metrics tends to be zero, and when they are further apart, they lead to positive infinity. So if the distance metric between two pdfs is close to zero then we can conclude that they are similar and if the distance is more, then we can conclude otherwise. All these formulae of these metrics are special cases of what's known as Minkowski distance. Euclidean distance The following code implements Euclidean distance. //Euclidean distance let euclidean (p:int list)(q:int list) =       List.zip p q |> List.sumBy (fun t -> float (fst t - snd t) ** 2.) City block distance The following code implements City block distance. let cityBlock (p:int list)(q:int list) =     List.zip p q |> List.sumBy (fun t -> float (fst t  - snd t)) Chebyshev distance The following code implements Chebyshev distance. let chebyshev(p:int list)(q:int list) =         List.zip p q |> List.map( fun t -> abs (fst t - snd t)) |> List.max L1 family This family of distances relies on normalization to keep the values within limit. All these metrics are of the form A/B where A is primarily a measure of proximity between the two pdfs P and Q. Most of the time A is calculated based on the absolute distance. For example, the numerator of the Sørensen distance is the City Block distance while the bottom is a normalization component that is obtained by adding each element of the two participating pdfs. Sørensen The following code implements Sørensendistance. let sorensen  (p:int list)(q:int list) =         let zipped = List.zip (p |> toPdf ) (q|>toPdf)         let numerator = zipped  |> List.sumBy (fun t -> float (fst t - snd t))         let denominator = zipped |> List.sumBy (fun t -> float (fst t + snd t))numerator / denominator Gower distance The following code implements Gowerdistance. There could be division by zero if the collection q is empty. let gower(p:int list)(q:int list)=         //I love this. Free flowing fluid conversion         //rather than cramping abs and fst t - snd t in a         //single line      let numerator = List.zip (p|>toPdf) (q |> toPdf)                          |> List.map (fun t -> fst t - snd t)                          |> List.map (fun z -> abs z)                          |> List.map float                          |> List.sum                          |> float      let denominator = float p.Length      numerator / denominator Soergel The following code implements Soergeldistance. let soergel (p:int list)(q:int list) =         let zipped = List.zip (p|>toPdf) (q |> toPdf)         let numerator =  zipped |> List.sumBy(fun t -> abs( fst t - snd t))         let denominator = zipped |> List.sumBy(fun t -> max (fst t ) (snd t))         float numerator / float denominator kulczynski d The following code implements Kulczynski ddistance. let kulczynski_d (p:int list)(q:int list) =         let zipped = List.zip (p|>toPdf) (q |> toPdf)         let numerator =  zipped |> List.sumBy(fun t -> abs( fst t - snd t))         let denominator = zipped |> List.sumBy(fun t -> min (fst t ) (snd t))         float numerator / float denominator kulczynski s The following code implements Kulczynski sdistance. let kulczynski_s (p:int list)(q:int list) = / kulczynski_d p q Canberra distance The following code implements Canberradistance. let canberra (p:int list)(q:int list) =     let zipped = List.zip (p|>toPdf) (q |> toPdf)     let numerator =  zipped |> List.sumBy(fun t -> abs( fst t - snd t))     let denominator = zipped |> List.sumBy(fun t -> fst t + snd t)     float numerator / float denominator Intersection family This family of distances tries to find the overlap between two participating pdfs. Intersection The following code implements Intersectiondistance. let intersection(p:int list ) (q: int list) =         List.zip (p|>toPdf) (q |> toPdf)             |> List.map (fun t -> min (fst t) (snd t)) |> List.sum Wave Hedges The following code implements Wave Hedgesdistance. let waveHedges (p:int list)(q:int list)=         List.zip (p|>toPdf) (q |> toPdf)             |> List.sumBy ( fun t -> 1.0 - float( min (fst t) (snd t))                                      / float (max (fst t) (snd t))) Czekanowski distance The following code implements Czekanowski distance. let czekanowski(p:int list)(q:int list) =         let zipped  = List.zip (p|>toPdf) (q |> toPdf)         let numerator = 2. * float (zipped |> List.sumBy (fun t -> min (fst t) (snd t)))         let denominator =  zipped |> List.sumBy (fun t -> fst t + snd t)         numerator / float denominator Motyka The following code implements Motyka distance.  let motyka(p:int list)(q:int list)=        let zipped = List.zip (p|>toPdf) (q |> toPdf)        let numerator = zipped |> List.sumBy (fun t -> min (fst t) (snd t))        let denominator = zipped |> List.sumBy (fun t -> fst t + snd t)        float numerator / float denominator Ruzicka The following code implements Ruzicka distance. let ruzicka (p:int list) (q:int list) =         let zipped = List.zip (p|>toPdf) (q |> toPdf)         let numerator = zipped |> List.sumBy (fun t -> min (fst t) (snd t))         let denominator = zipped |> List.sumBy (fun t -> max (fst t) (snd t))         float numerator / float denominator Inner Product family Distances belonging to this family are calculated by some product of pairwise elements from both the participating pdfs. Then this product is normalized with a value also calculated from the pairwise elements. Innerproduct The following code implements Inner-productdistance. let innerProduct(p:int list)(q:int list)=         List.zip (p|>toPdf) (q |> toPdf)             |> List.sumBy (fun t -> fst t * snd t) Harmonic mean The following code implements Harmonicdistance. let harmonicMean(p:int list)(q:int list)=        2. * (List.zip (p|>toPdf) (q |> toPdf)           |> List.sumBy (fun t -> ( fst t * snd t )/(fst t + snd t))) Cosine similarity The following code implements Cosine Similarity distance measure. let cosineSimilarity(p:int list)(q:int list)=     let zipped = List.zip p q     let prod  (x,y) = float x *  float y     let numerator = zipped |> List.sumBy prod     let denominator  =  sqrt ( p|> List.map sqr |> List.sum |> float) *                         sqrt ( q|> List.map sqr |> List.sum |> float)     numerator / denominator Kumar Hassebrook The following code implements Kumar Hassebrook distance measure.   let kumarHassebrook (p:int list) (q:int list) =         let sqr x = x * x         let zipped = List.zip (p|>toPdf) (q |> toPdf)         let numerator = zipped |> List.sumBy prod         let denominator =  (p |> List.sumBy sqr) +                            (q |> List.sumBy sqr) - numerator           numerator / denominator Dice coefficient The following code implements Dice coefficient.   let dicePoint(p:int list)(q:int list)=             let zipped = List.zip (p|>toPdf) (q |> toPdf)         let numerator = zipped |> List.sumBy (fun t -> fst t * snd t)         let denominator  =  (p |> List.sumBy sqr)  +                             (q |> List.sumBy sqr)           float numerator / float denominator Fidelity family or squared-chord family This family of distances uses square root as an instrument to keep the distance within limit. Sometimes other functions, such as log, are also used. Fidelity The following code implements FidelityDistance measure.   let fidelity(p:int list)(q:int list)=               List.zip (p|>toPdf) (q |> toPdf)|> List.map  prod                                         |> List.map sqrt                                         |> List.sum Bhattacharya The following code implements Bhattacharya distance measure. let bhattacharya(p:int list)(q:int list)=         -log (fidelity p q) Hellinger The following code implements Hellingerdistance measure. let hellinger(p:int list)(q:int list)=             let product = List.zip (p|>toPdf) (q |> toPdf)                          |> List.map prod |> List.sumBy sqrt        let right = 1. -  product        2. * right |> abs//taking this off will result in NaN                   |> float                   |> sqrt Matusita The following code implements Matusita distance measure. let matusita(p:int list)(q:int list)=         let value =2. - 2. *( List.zip (p|>toPdf) (q |> toPdf)                             |> List.map prod                             |> List.sumBy sqrt)         value |> abs |> sqrt Squared Chord The following code implements Squared Chord distance measure. let squarredChord(p:int list)(q:int list)=         List.zip (p|>toPdf) (q |> toPdf)             |> List.sumBy (fun t -> sqrt (fst t ) - sqrt (snd t)) Squared L2 family This is almost the same as the L1 family, just that it got rid of the expensive square root operation and relies on the squares instead. However, that should not be an issue. Sometimes the squares can be quite large so a normalization scheme is provided by dividing the result of the squared sum with another square sum, as done in "Divergence". Squared Euclidean The following code implements Squared Euclidean distance measure. For most purpose this can be used instead of Euclidean distance as it is computationally cheap and performs as well. let squaredEuclidean (p:int list)(q:int list)=        List.zip (p|>toPdf) (q |> toPdf)            |> List.sumBy (fun t-> float (fst t - snd t) ** 2.0) Squared Chi The following code implements Squared Chi distance measure. let squaredChi(p:int list)(q:int list)=         List.zip (p|>toPdf) (q |> toPdf)            |> List.sumBy (fun t -> (fst t - snd t ) ** 2.0 / (fst t + snd t)) Pearson's Chi The following code implements Pearson’s Chidistance measure. let pearsonsChi(p:int list)(q:int list)=         List.zip (p|>toPdf) (q |> toPdf)            |> List.sumBy (fun t -> (fst t - snd t ) ** 2.0 / snd t) Neyman's Chi The following code implements Neyman’s Chi distance measure. let neymanChi(p:int list)(q:int list)=         List.zip (p|>toPdf) (q |> toPdf)            |> List.sumBy (fun t -> (fst t - snd t ) ** 2.0 / fst t) Probabilistic Symmetric Chi The following code implements Probabilistic Symmetric Chidistance measure. let probabilisticSymmetricChi(p:int list)(q:int list)=        2.0 * squaredChi p q Divergence The following code implements Divergence measure. This metric is useful when the elements of the collections have elements in different order of magnitude. This normalization will make the distance properly adjusted for several kinds of usages. let divergence(p:int list)(q:int list)=         List.zip (p|>toPdf) (q |> toPdf)             |> List.sumBy (fun t -> (fst t - snd t) ** 2. / (fst t + snd t) ** 2.) Clark The following code implements Clark’s distance measure. let clark(p:int list)(q:int list)=               sqrt( List.zip (p|>toPdf) (q |> toPdf)         |> List.map (fun t ->  float( abs( fst t - snd t))                                / (float (fst t + snd t)))           |> List.sumBy (fun t -> t * t)) Additive Symmetric Chi The following code implements Additive Symmetric Chi distance measure. let additiveSymmetricChi(p:int list)(q:int list)=         List.zip (p|>toPdf) (q |> toPdf)             |> List.sumBy (fun t -> (fst t - snd t ) ** 2. * (fst t + snd t)/ (fst t * snd t))  Summary Congratulations!! you learnt how different similarity measures work and when to use which one, to find the closest match. Edmund Burke said that, "It's the nature of every greatness not to be exact", and I can't agree more. Most of the time the users aren't really sure what they are looking for. So providing a binary answer of yes or no, or found or not found is not that useful. Striking the middle ground by attaching a confidence score to each result is the key. The techniques that you learnt will prove to be useful when we deal with recommender system and anomaly detection, because both these fields rely heavily on the IR techniques. Resources for Article:   Further resources on this subject: Learning Option Pricing [article] Working with Windows Phone Controls [article] Simplifying Parallelism Complexity in C# [article]
Read more
  • 0
  • 0
  • 1536

article-image-learning-xero-purchases
Packt
30 Dec 2015
14 min read
Save for later

Learning Xero Purchases

Packt
30 Dec 2015
14 min read
In this article written by Jon Jenkins, author of the book Learning Xero, the author wants us to learn all of the Xero core purchase processes from posting purchase bills and editing contacts to making supplier payments. You will learn how the purchase process works and how that impacts on inventory. By the end of this article,you will have a thorough understanding of the purchase dashboard and its component parts as well as the ability to easily navigate to the areas you need. These are the topics we'll cover in this article: Understanding the purchase dashboard layout Adding new contacts and posting bills Adding new inventory items to bills (For more resources related to this topic, see here.) Dashboard The purchases dashboard in Xero is your one-stop shop for everything you need to manage the purchases for the business. To get to the purchases dashboard,you can click on Bills you need to payon the main dashboard or navigate to Accounts|Purchases. We have highlighted the major elements that make up the following dashboard. On the right-hand side of the Dashboard, you will find a Search button. Use it to save time searching by bill number, reference, supplier, or amount, and drill down even further by selecting withina date range by due date or transaction date. On the left-hand side of the dashboard, you have the main controls for performing tasks such as adding a new bill, repeating bill, credit note, or purchase order. Under the main controls are the summary figures of the purchases, with the figure in brackets representing the number of transactions that make up the figure and the numerical value representing the total of those transactions. Clicking on any of these sections will drill down to the bills that make up that total. You can also see draft bills that need approval. Untilbills have been approved (so anything with a Draft or Awaiting Approval status), they will not be included in any reports you run within Xero, such as the profit and loss account.So, if you have an approval process within your business,ensure that people adhere to it to improve the level of accuracy of reports. Once you click on the summary, you will see the following table, which shows the bills making up that selection. You can also see tabs relating to the other criteria within Xero, making it easy to navigate between the various lists without having to perform searches or navigating away from this screen and breaking your workflow. This view also allows you to add new transactions and mark bills as Paid, providing you with an uninterrupted workflow across the process of making purchases. Addingcontacts You can add a contact in Xero when raising abill to avoid navigating away and coming back, but you cannot enter any of the main contact details at that point. If you need to enter a purchase order to a new supplier,we would recommend that you add the supplier first by navigating to Contacts|All Contacts|Add Contact, so you have all the correct details for issuing the document. Importing When you first start using Xero or even if you have been using it for a while, you may have a separate database elsewhere with your contacts. You can import these into Xero using the predetermined CSV file template. This will enable you to keep all of your records in sync. Navigate to the Contacts|All Contacts|Import as shown in the following screenshot: When you click on Import, this action will then take you to a screen where you can download the Xero contact import template. Download the file so that you can compile your file for importing. We would recommend doing a search in the Xero Help Guide at this point on Import Contact and take a look through the guide shown in the following screenshot before you attempt to import to ensure that you have configured the file correctly and there is a smaller chance of the file being rejected: Editing a contact Once a contact record has been set up, you can edit the details by navigating to Contacts|Suppliers. From here, you can find the supplier you need using the letters at the top or using the Search facility. When you click on the supplier, the Edit button will take you into the contact record so that you can make changes as required. Ensure that you save any changes you have made. A few of the options you may wish to complete here to make the processing of purchases a bit easier and quicker is to add defaults. The items that you can default are the pieces of account code where you would like to post the bills and whether the amounts are exclusive or inclusive of tax. These will help make it quicker to post bills and reduce the likelihood of mispostings. These can, of course, be overwritten when posting bills. You can also choose a default VAT tax rate, which again can resolve issues where people are not sure which tax rate to use. There are various other default settings you can choose in the contact record, and we do suggest that you take a look at these to see where you can both reduce the opportunity for mistakes being made while making it quicker to process bills. Purchasing Getting the payment of suppliers bills correct is fundamental to running a tight ship. Making mistakes when it comes to purchasing goods can lead to incorrect stock holding, overpayments to suppliers, or deliveries being put on hold, which can have a significant impact on the trading of the business. It is therefore crucial that you do things in the correct way. Standardbills There are many ways to create a bill in Xero. From the Bills you need to paysection on the right-hand side of the main dashboard when you first login, click on New bill, as shown in the following screenshot: Alternatively, you can do it by navigating to Accounts |Purchases | New. You may also notice that when you have drilled down into any of the summary items on the dashboard, above each tab, you will also see the option to add new bill, as shown in the following screenshot: As you can see, there are many ways to raise a new bill, but whichever way you do it, you will then be shown this screen: If you start typing the supplier name in the From field, you will be provided with a list of supplier names beginning with those letters to make it easier and quicker to make a selection. If there is no contact recordat this point, you can click on + add contact to add a new supplier name so that you can complete raising the bill. As you start typing,Xero will provide a list of possible matches, so be careful not to set up multiple supplier accounts. The Datefield of the bill will default to the day on which you raise the bill. You can change this if you wish to. If you tab through Due Date, it will default to the day the bill is raised. If you have selected business-wide default payment terms, then the date will default to what you have selected.Likewise, if you have set a supplier default credit term, then that will override the business default at this point. The Reference field is where you will enter the supplier invoice number. The paper icon allows you to upload items, suchas a copy of the bill, a purchase order, or a contract perhaps,and attach them to the bill. It is up to you, but attaching documents, such as bills, makes life a lot easier as clicking on the paper icon allows you to see the document on screen, so no more sifting through filing cabinets and folders. You can also attach more than one document if you need to. You can use the Total field as a double-check if you like, but you do not have to use it. If you prefer, you can enter the bill total in this field. However, if what you have entered does not total when you try to post the invoice, this action will tell you, and you can adjust accordingly. You can change the currencyof the bill at this point using the Currency field,but again, if you have suppliers that bill you a different currency, we would suggest setting this as a default in the contact record to avoid posting the supplier bill in the wrong currency. If you do not have multiple currencies set up in Xero, you will not see this option when posting a bill. The Amounts arefield can be used to toggle between tax-exclusive and tax-inclusive values, which makes it easier when posting bills.If you have the gross purchase figure, you would use the tax inclusive figure and allow Xero to automatically calculate the VAT for you without using a calculator. Inventory items Inventory items on a bill can save a lot of unnecessary data entry if you have standard services or products that you purchase within the business. You can add an inventory item when posting a bill, or you can add them by navigating to Settings|General Settings|Inventory Items. If you have decided to use inventory items, it would be sensible to do some from the start and use the Import file to set them up quickly and easily. Each time you post a bill, you can then select the inventory item;this will pull through the default description and amounts for that inventory item. You can then adjust the descriptions as necessary but it saves you from having to type your product descriptions over and over again. If you are using inventory items, the amount you have purchased will be recorded against the inventory item.If you useXero for inventory control, it will make the necessary adjustments between your inventory on hand and the cost of goods sold in the profit and loss account. If you do not use inventory items, as a minimum, you will need to enter a description, a quantity, the unit price, the purchases account you are posting it to, and the tax rate to be used. Along with making the data entry quicker, inventory items give you the ability to run purchase reports by items, meaning that you do not have to set up a different account code for each type of goods you wish to generate reportsfor. It is worth taking some time to think about what you want to report on before posting your first bills. Once you have finished entering your bill details, you can then save your billor approve your bill. The User Roles that have been assigned to the employees will drive the functionality they have available here. If you choose to save a bill this appears in the Draft section of the purchase summary on the dashboard. If you have a purchase workflow in your business where a member of staff can raise but not approve a bill, then Save would be the option for them. As you can see in the following screenshot,there are several options available. If you are posting several bills in one go, you would choose Save & add another as this saves you from being sent to the purchases dashboard each time and then having to navigate back to this section. In order for a bill to be posted the accounts ledgers, it will need to be approved. Once the bill has been approved, you will see that a few extra options have now appeared both above and below the invoice. Above the invoice, you now have the ability toprint the invoice as a PDF, attach a document, or edit the bill by clicking on Bill Options. Under the bill, you will now have a new box appear, which gives you the ability to mark the bill as paid if it has been paid. Under this, you now have the History & Notes section, which gives you a full audit trail of what has happened to the bill, includingif anyone has edited the bill. Repeatingbills The process for raising a repeating bill is exactly the same as for a standard bill except for completing the additional fields shown in the following screenshot. You can choose the frequency of the bill to repeat, the date it is to start from and the end date which is optional. The Due Datefield is set in the bill settings as default to your business default or the supplier default if you have a default set up in the Contact record. Before you can save the bill,you will have to select how you want the bill to be posted. If you select Save as Draft,someone will have to approve each bill. If you select Approve,the bill will get posted to the accounting records with no manual intervention. Here are a couple of points to note if using repeating bills: You would only want to use this for invoices where there are regular items to be posted to the same account code and for the same amount. It is easy to forget that you have set these up and end up posting the bill as well and duplicating transactions. You can set a reference for the repeating bill and this will be the same for all bills posted. If you were to select the Save as Draft option rather than the Approve option, it will give you a chance to amend the reference to the correct invoice number before posting. You can enter placeholders to add the Week, Month, Year or a combination of those to ensure that the correct narrative is used in the description.However, use this for reference and avoid the point raised previously about using Save as Draft instead of Approve. Xeronetwork key The Xero Network Keyallows you to receive your bill's data directly into your Xero draft purchases section from someone else's salesledger if they use Xero. This can be a massive time saver if you are doing lots of transactions with other Xero users. Each Xero subscription has its own unique Xero Network Key, which you can share with the other Xero business by entering it into the relevant field in the contact record. If you opt to do this, your bill data will be passed through to the Draft Purchases section in Xero and you will need to check the details, enter the account code to post it to, and then approve. Here, minimal data entry is required, and this is a much quicker process. To locate your Xero Network Key to be provided to the other Xero user,navigate to Settings|General Settings|Xero to Xero. Batch payments If you pay multiple invoices from a single supplier or your bank gives you the ability to produce a BACS file to import to your online banking, then you may wish to use batch payments. Batch Payments speed up the process of paying suppliers and reconciling the bank account. It can do this in three ways: You can mark lots of invoices from multiple suppliers as paid at the same time. You can create a file to be imported into your online banking. When the supplier payments appear on the bank feed, the amount paid from the online banking will match with what you have allocated in Xero, meaning that autosuggest comes into play and will make the correct suggestion rather than you having to use Find & Match to find all of the individual payments you allocated. This is time consuming and error prone. Summary We have successfully added a purchases bill and identified the different types of bills available in Xero. In this article, we have run through the major purchases functions and we setup Repeating Bills. We explored how to use inventory items to make the purchases invoicing process even easier and quicker. On top of that, we have also looked at how to navigate around the purchases dashboard, how to make changes, and also how to track exactly what has happened to a bill. Resources for Article: Further resources on this subject: Big Data Analytics [article] Why Big Data in the Financial Sector? [article] Big Data [article]
Read more
  • 0
  • 0
  • 1350

article-image-video-surveillance-background-modeling
Packt
30 Dec 2015
7 min read
Save for later

Video Surveillance, Background Modeling

Packt
30 Dec 2015
7 min read
In this article by David Millán Escrivá, Prateek Joshi and Vinícius Godoy the authors of the book OpenCV By Example, willIn order to detect moving objects, we first need to build a model of the background. This is not the same as the direct frame differencing because we are actually modeling the background and using this model to detect moving objects. When we say that we are modeling the background, we are basically building a mathematical formulation that can be used to represent the background. So, this performs in a much better way than the simple frame differencing technique. This technique tries to detect static parts of the scene and then includes builds (updates?) in the background model. This background model is then used to detect background pixels. So, it's an adaptive technique that can adjust according to the scene. (For more resources related to this topic, see here.) Naive background subtraction Let's start the discussion from the beginning. What does a background subtraction process look like? Consider the following image: The preceding image represents the background scene. Now, let's introduce a new object into this scene: As shown in the preceding image, there is a new object in the scene. So, if we compute the difference between this image and our background model, you should be able to identify the location of the TV remote: The overall process looks like this: Does it work well? There's a reason why we call it the naive approach. It works under ideal conditions, and as we know, nothing is ideal in the real world. It does a reasonably good job of computing the shape of the given object, but it does so under some constraints. One of the main requirements of this approach is that the color and intensity of the object should be sufficiently different from that of the background. Some of the factors that affect these kinds of algorithms are image noise, lighting conditions, autofocus in cameras, and so on. Once a new object enters our scene and stays there, it will be difficult to detect new objects that are in front of it. This is because we don't update our background model, and the new object is now part of our background. Consider the following image: Now, let's say a new object enters our scene: We identify this to be a new object, which is fine. Let's say another object comes into the scene: It will be difficult to identify the location of these two different objects because their locations overlap. Here's what we get after subtracting the background and applying the threshold: In this approach, we assume that the background is static. If some parts of our background start moving, then those parts will start getting detected as new objects. So, even if the movements are minor, say a waving flag, it will cause problems in our detection algorithm. This approach is also sensitive to changes in illumination, and it cannot handle any camera movement. Needless to say, it's a delicate approach! We need something that can handle all these things in the real world. Frame differencing We know that we cannot keep a static background image that can be used to detect objects. So, one of the ways to fix this would be to use frame differencing. It is one of the simplest techniques that we can use to see what parts of the video are moving. When we consider a live video stream, the difference between successive frames gives a lot of information. The concept is fairly straightforward. We just take the difference between successive frames and display the difference. If I move my laptop rapidly, we can see something like this: Instead of the laptop, let's move the object and see what happens. If I rapidly shake my head, it will look something like this: As you can see in the preceding images, only the moving parts of the video get highlighted. This gives us a good starting point to see the areas that are moving in the video. Let's take a look at the function to compute the frame difference: Mat frameDiff(Mat prevFrame, Mat curFrame, Mat nextFrame) { Mat diffFrames1, diffFrames2, output; // Compute absolute difference between current frame and the next frame absdiff(nextFrame, curFrame, diffFrames1); // Compute absolute difference between current frame and the previous frame absdiff(curFrame, prevFrame, diffFrames2); // Bitwise "AND" operation between the above two diff images bitwise_and(diffFrames1, diffFrames2, output); return output; } Frame differencing is fairly straightforward. You compute the absolute difference between the current frame and previous frame and between the current frame and next frame. We then take these frame differences and apply bitwise AND operator. This will highlight the moving parts in the image. If you just compute the difference between the current frame and previous frame, it tends to be noisy. Hence, we need to use the bitwise AND operator between successive frame differences to get some stability when we see the moving objects. Let's take a look at the function that can extract and return a frame from the webcam: Mat getFrame(VideoCapture cap, float scalingFactor) { //float scalingFactor = 0.5; Mat frame, output; // Capture the current frame cap >> frame; // Resize the frame resize(frame, frame, Size(), scalingFactor, scalingFactor, INTER_AREA); // Convert to grayscale cvtColor(frame, output, CV_BGR2GRAY); return output; } As we can see, it's pretty straightforward. We just need to resize the frame and convert it to grayscale. Now that we have the helper functions ready, let's take a look at the main function and see how it all comes together: int main(int argc, char* argv[]) { Mat frame, prevFrame, curFrame, nextFrame; char ch; // Create the capture object // 0 -> input arg that specifies it should take the input from the webcam VideoCapture cap(0); // If you cannot open the webcam, stop the execution! if( !cap.isOpened() ) return -1; //create GUI windows namedWindow("Frame"); // Scaling factor to resize the input frames from the webcam float scalingFactor = 0.75; prevFrame = getFrame(cap, scalingFactor); curFrame = getFrame(cap, scalingFactor); nextFrame = getFrame(cap, scalingFactor); // Iterate until the user presses the Esc key while(true) { // Show the object movement imshow("Object Movement", frameDiff(prevFrame, curFrame, nextFrame)); // Update the variables and grab the next frame prevFrame = curFrame; curFrame = nextFrame; nextFrame = getFrame(cap, scalingFactor); // Get the keyboard input and check if it's 'Esc' // 27 -> ASCII value of 'Esc' key ch = waitKey( 30 ); if (ch == 27) { break; } } // Release the video capture object cap.release(); // Close all windows destroyAllWindows(); return 1; } How well does it work? As we can see, frame differencing addresses a couple of important problems that we faced earlier. It can quickly adapt to lighting changes or camera movements. If an object comes in the frame and stays there, it will not be detected in the future frames. One of the main concerns of this approach is about detecting uniformly colored objects. It can only detect the edges of a uniformly colored object. This is because a large portion of this object will result in very low pixel differences, as shown in the following image: Let's say this object moved slightly. If we compare this with the previous frame, it will look like this: Hence, we have very few pixels that are labeled on that object. Another concern is that it is difficult to detect whether an object is moving toward the camera or away from it. Resources for Article: Further resources on this subject: Tracking Objects in Videos [article] Detecting Shapes Employing Hough Transform [article] Hand Gesture Recognition Using a Kinect Depth Sensor [article]
Read more
  • 0
  • 0
  • 1581
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
article-image-transforming-data-pivot-transform-data-services
Packt
05 Nov 2015
7 min read
Save for later

Transforming data with the Pivot transform in Data Services

Packt
05 Nov 2015
7 min read
Transforming data with the Pivot transform in Data Services In this article by Iwan Shomnikov, author of the book SAP Data Services 4.x Cookbook, you will learn that the Pivot transform belongs to the Data Integrator group of transform objects in Data Services, which are usually all about the generation or transformation (meaning change in the structure) of data. Simply inserting the Pivot transform allows you to convert columns into rows. Pivot transformation increases the amount of rows in a dataset as for every column converted into a row, an extra row is created for every key (the non-pivoted column) pair. Converted columns are called pivot columns. Pivoting rows to columns or columns to rows is quite a common transformation operation in data migration tasks, and traditionally, the simplest way to perform it with the standard SQL language is to use the decode() function inside your SELECT statements. Depending on the complexity of the source and target datasets before and after pivoting, the SELECT statement can be extremely heavy and difficult to understand. Data Services provide a simple and flexible way of pivoting data inside the Data Services ETL code using the Pivot and Reverse_Pivot dataflow object transforms. The following steps show how exactly you can create, configure, and use these transforms in Data Services in order to pivot your data. (For more resources related to this topic, see here.) Getting ready We will use the SQL Server database for the source and target objects, which will be used to demonstrate the Pivot transform available in Data Services. The steps in this section describe the preparation of a source table and the data required in it for a demonstration of the Pivot transform in the Data Services development environment: Create a new database or import the existing test database, AdventureWorks OLTP, available for download and free test use at https://msftdbprodsamples.codeplex.com/releases/view/55330. We will download the database file from the preceding link and deploy it to SQL Server, naming our database AdventureWorks_OLTP. Run the following SQL statements against the AdventureWorks_OLTP database to create a source table and populate it with data: create table Sales.AccountBalance ( [AccountID] integer, [AccountNumber] integer, [Year] integer, [Q1] decimal(10,2), [Q2] decimal(10,2), [Q3] decimal(10,2), [Q4] decimal(10,2)); -- Row 1 insert into Sales.AccountBalance ([AccountID],[AccountNumber],[Year],[Q1],[Q2],[Q3],[Q4]) values (1,100,2015,100.00,150.00,120.00,300.00); -- Row 2 insert into Sales.AccountBalance ([AccountID],[AccountNumber],[Year],[Q1],[Q2],[Q3],[Q4]) values (2,100,2015,50.00,350.00,620.00,180.00); -- Row 3 insert into Sales.AccountBalance ([AccountID],[AccountNumber],[Year],[Q1],[Q2],[Q3],[Q4]) values (3,200,2015,333.33,440.00,12.00,105.50); So, the source table would look similar to the one in the following figure: Create an OLTP datastore in Data Services, referencing the AdventureWorks_OLTP database and AccountBalance import table created in the previous step in it. Create the DS_STAGE datastore in Data Services pointing to the same OLTP database. We will use this datastore as a target to stage in our environment, where we insert the resulting pivoted dataset extracted from the OLTP system. How to do it… This section describes the ETL development process, which takes place in the Data Services Designer application. We will not create any workflow or script object in our test jobs; we will keep things simple and have only one batch job object with a dataflow object inside it performing the migration and pivoting of data from the ACCOUNTBALANCE source table of our OLTP database. Here are the steps to do this: Create a new batch job object and place the new dataflow in it, naming it DF_OLTP_Pivot_STAGE_AccountBalance. Open the dataflow in the workspace window to edit it, and place the ACCOUNTBALANCE source table from the OLTP datastore created in the preparation steps. Link the source table to the Extract query transform, and propagate all the source columns to the target schema. Place the new Pivot transform object in a dataflow and link the Extract query to it. The Pivot transform can be found by navigating to Local Object Library | Transforms | Data Integrator. Open the Pivot transform in the workspace to edit it, and configure its parameters according to the following screenshot:   Close the Pivot transform and link it to another query transform named Prepare_to_Load. Propagate all the source columns to the target schema of the Prepare_to_Load transform, and finally link it to the target ACCOUNTBALANCE template table created in the DS_STAGE datastore. Choose the dbo schema when creating the ACCOUNTBALANCE template table object in this datastore. Before executing the job, open the Prepare_to_Load query transform in a workspace window, double-click on the PIVOT_SEQ column, and select the Primary key checkbox to specify the additional column as being the primary key column for the migrated dataset. Save and run the job, selecting the default execution options. Open the dataflow again and import the target table, putting the Delete data from table before loading flag in the target table loading options. How it works… Pivot columns are columns whose values are merged in one column after the pivoting operation, thus producing an extra row for every pivoted column. Non-pivot columns are columns that are not affected by the pivot operation. As you can see, the pivoting operation denormalizes the dataset, generating more rows. This is why ACCOUNTID does not define the uniqueness of the record anymore, and we have to specify the extra key column, PIVOT_SEQ.   So, you may wonder, why pivot? Why don't we just use data as it is and perform the required operation on data from the columns Q1-Q4? The answer, in the given example, is very simple—it is much more difficult to perform an aggregation when the amounts are spread across different columns. Instead of summarizing using a single column with the sum(AMOUNT) function, we would have to write the sum(Q1 + Q2 + Q3 + Q4) expression every time. Quarters are not the worst part yet; imagine a situation where a table has huge amounts of data stored in columns defining month periods and you have to filter data by these time periods. Of course, the opposite case exists as well; storing data across multiple columns instead of just in one is justified. In this case, if your data structure is not similar to this, you can use the Reverse_Pivot transform, which does exactly the opposite—it converts rows into columns. Look at the following example of a Reverse_Pivot configuration: Reverse pivoting or transformation of rows into columns leads us to introduce another term—Pivot axis column. This is a column that holds categories defining different columns after a reverse pivot operation. It corresponds to the Header column option in the Pivot transform configuration. Summary As you noted in this article, the Pivot and Reverse_Pivot transform objects available in Data Services Designer are a simple and easily configurable way to pivot data of any complexity. The GUI of the Designer tool makes maintaining the ETL process developed in Data Services easy and keeps it readable. If you make any changes to the pivot configuration options, Data Services automatically regenerates the output schema in pivot transforms accordingly.   Resources for Article: Further resources on this subject: Sabermetrics with Apache Spark [article] Understanding Text Search and Hierarchies in SAP HANA [article] Meeting SAP Lumira [article]
Read more
  • 0
  • 0
  • 4302

article-image-getting-started-tableau-public
Packt
04 Nov 2015
12 min read
Save for later

Getting Started with Tableau Public

Packt
04 Nov 2015
12 min read
In this article by Ashley Ohmann and Matthew Floyd, the authors of Creating Data Stories with tableau Public. Making sense of data is a valued service in today's world. It may be a cliché, but it's true that we are drowning in data and yet, we are thirsting for knowledge. The ability to make sense of data and the skill of using data to tell a compelling story is becoming one of the most valued capabilities in almost every field—business, journalism, retail, manufacturing, medicine, and public service. Tableau Public (for more information, visit www.tableaupublic.com), which is Tableau 's free Cloud-based data visualization client, is a powerfully transformative tool that you can use to create rich, interactive, and compelling data stories. It's a great platform if you wish to explore data through visualization. It enables your consumers to ask and answer questions that are interesting to them. This article is written for people who are new to Tableau Public and would like to learn how to create rich, interactive data visualizations from publicly available data sources that they can easily share with others. Once you publish visualizations and data to Tableau Public, they are accessible to everyone, and they can be viewed and downloaded. A typical Tableau Public data visualization contains public data sets such as sports, politics, public works, crime, census, socioeconomic metrics, and social media sentiment data (you also can create and use your own data). Many of these data sets either are readily available on the Internet, or can accessed via a public records request or search (if they are harder to find, they can be scraped from the Internet). You can now control who can download your visualizations and data sets, which is a feature that was previously available only to the paid subscribers. Tableau Public has a current maximum data set size of 10 million rows and/or 10 GB of data. (For more resources related to this topic, see here.) In this article, we will walk through an introduction to Tableau, which includes the following topics: A discussion on how you can use Tableau Public to tell your data story Examples of organizations that use Tableau Public Downloading and installing the Tableau Public software Logging in to Tableau Public Creating your very own Tableau Public profile Discovering the Tableau Public features and resources Taking a look at the author profiles and galleries on the Tableau website to browse other authors' data visualizations (this is a great way to learn and gather ideas on how to best present our data) An Tableau Public overview Tableau Public allows everyone to tell their data story and create compelling and interactive data visualizations that encourage discovery and learning. Tableau Public is sold at a great price—free! It allows you as a data storyteller to create and publish data visualizations without learning how to code or having special knowledge of web publishing. In fact, you can publish data sets of up to 10 million rows or 10 GB to Tableau Public in a single workbook. Tableau Public is a data discovery tool. It should not be confused with enterprise-grade business intelligence tools, such as Tableau Desktop and Tableau Server, QlikView, and Cognos Insight. Those tools integrate with corporate networks and security protocol as well as server-based data warehouses. Data visualization software is not a new thing. Businesses have used software to generate dashboards and reports for decades. The twist comes with data democracy tools, such as Tableau Public. Journalists and bloggers who would like to augment their reporting of static text and graphics can use these data discovery tools, such as Tableau Public, to create riveting, rich data visualizations, which may comprise one or more charts, graphs, tables, and other objects that may be controlled by the readers to allow for discovery. The people who are active members of the Tableau Public community have a few primary traits in common, they are curious, generous with their knowledge and time, and enjoy conversations that relate data to the world around us. Tableau Public maintains a list of blogs of data visualization experts using Tableau software. In the following screenshot, Tableau Zen Masters, Anya A'hearn of Databrick and Allan Walker, used data on San Francisco bike sharing to show the financial benefits of the Bay Area Bike Share, a city-sponsored 30-minute bike sharing program, as well as a map of both the proposed expansion of the program and how far a person can actually ride a bike in half an hour. This dashboard is featured in the Tableau Public gallery because it relates data to users clearly and concisely. It presents a great public interest story (commuting more efficiently in a notoriously congested city) and then grabs the viewer's attention with maps of current and future offerings. The second dashboard within the analysis is significant as well. The authors described the Geographic Information Systems (GIS) tools that they used to create their innovative maps as well as the methodology that went into the final product so that the users who are new to the tool can learn how to create a similar functionality for their own purposes: Image republished under the terms of fair use, creators: Anya A'hearn and Allan Walker. Source: https://public.tableausoftware.com/views/30Minutes___BayAreaBikeShare/30Minutes___?:embed=y&:loadOrderID=0&:display_count=yes As humans, we relate our experiences to each other in stories, and data points are an important component of stories. They quantify phenomena and, when combined with human actions and emotions, can make them more memorable. When authors create public interest story elements with Tableau Public, readers can interact with the analyses, which creates a highly personal experience and translates into increased participation and decreased abandonment. It's not difficult to embed the Tableau Public visualizations into websites and blogs. It is as easy as copying and pasting JavaScript that Tableau Public renders for you automatically. Using Tableau Public increases accessibility to stories, too. You can view data stories on mobile devices with a web browser and then share it with friends on social media sites such as Twitter and Facebook using Tableau Public's sharing functionality. Stories can be told with the help of text as well as popular and tried-and-true visualization types such as maps, bar charts, lists, heat maps, line charts, and scatterplots. Maps are particularly easier to build in Tableau Public than most other data visualization offerings because Tableau has integrated geocoding (down to the city and postal code) directly into the application. Tableau Public has a built-in date hierarchy that makes it easy for users to drill through time dimensions just by clicking on a button. One of Tableau Software's taglines, Data to the People, is a reflection not only of the ability to distribute analysis sets to thousands of people in one go, but also of the enhanced abilities of nontechnical users to explore their own data easily and derive relevant insights for their own community without having to learn a slew of technical skills. Telling your story with Tableau Public Tableau was originally developed in the Stanford University Computer Science department, where a research project sponsored by the U.S. Department of Defense was launched to study how people can analyze data rapidly. This project merged two branches of computer science, understanding data relationships and computer graphics. This mash-up was discovered to be the best way for people to understand and sometimes digest complex data relationships rapidly and, in effect, to help readers consume data. This project eventually moved from the Stanford campus to the corporate world, and Tableau Software was born. The Tableau usage and adoption has since skyrocketed at the time of writing this book. Tableau is the fastest growing software company in the world and now, Tableau competes directly with the older software manufacturers for data visualization and discovery—Microsoft, IBM, SAS, Qlik, and Tibco, to name a few. Most of these are compared to each other by Gartner in its annual Magic Quadrant. For more information, visit http://www.gartner.com/technology/home.jsp. Tableau Software's flagship program, Tableau Desktop, is commercial software used by many organizations and corporations throughout the world. Tableau Public is the free version of Tableau's offering. It is typically used with nonconfidential data either from the public domain or that which we collected ourselves. This free public offering of Tableau Public is truly unique in the business intelligence and data discovery industry. There is no other software like it—powerful, free, and open to data story authors. There are a few terms in this article that might be new to you. You, as an author, will load data into a workbook, which will be saved by you in the Tableau Public cloud. A visualization is a single graph. It is typically on a worksheet. One or more visualizations are on a dashboard, which is where your users will interact with your data. One of the wonderful features about Tableau Public is that you can load data and visualize it on your own. Traditionally, this has been an activity that was undertaken with the help of programmers at work. With Tableau Public and newer blogging platforms, nonprogrammers can develop data visualization, publish it to the Tableau Public website, and then embed the data visualization on their own website. The basic steps that are required to create a Tableau Public visualization are as follows: Gather your data sources, usually in a spreadsheet or a .csv file. Prepare and format your data to make it usable in Tableau Public. Connect to the data and start building the data visualizations (charts, graphs, and many other objects). Save and publish your data visualization to the Tableau Public website. Embed your data visualization in your web page by using the code that Tableau Public provides. Tableau Public is used by some of the leading news organizations across the world, including The New York Times, The Guardian (UK), National Geographic (US), the Washington Post (US), the Boston Globe (US), La Informacion (Spain), and Época (Brazil). Now, we will discuss installing Tableau Public. Then, we will take a look at how we can find some of these visualizations out there in the wild so that we can learn from others and create our own original visualizations. Installing Tableau Public Let's look at the steps required for the installation of Tableau Public: To download Tableau Public, visit the Tableau Software website at http://public.tableau.com/s/. Enter your e-mail address and click on the Download the App button located at the center of the screen, as shown in following screenshot: The downloaded version of Tableau Public is free, and it is not a limited release or demo version. It is a fully functional version of Tableau Public. Once the download begins, a Thank You screen gives you an option of retrying the download if it does not begin automatically or starts downloading a different version. The version of Tableau Public that gets downloaded automatically is the 64-bit version for Windows. Users of Macs should download the appropriate version for their computers, and users with 32-bit Windows machines should download the 32-bit version. Check your Windows computer system type (32- or 64-bit) by navigating to Start then Computer and right-clicking on the Computer option. Select Properties, and view the System properties. 64-bit systems will be noted as such. 32-bit systems will either state that they are 32-bit ones, or not have any indication of being a 32- or 64-bit system. While the Tableau Public executable file downloads, you can scroll the Thank You page to the lower section to learn more about the new features of Tableau Public 9.0. The speed with which Tableau Public downloads depends on the download speed of your network, and the 109 MB file usually takes a few minutes to download. The TableauPublicDesktop-xbit.msi (where x=32 or 64, depending on which version you selected) is downloaded. Navigate to the .msi file in Windows Explorer or in the browser window and click on Open. Then, click on Run in the Open File - Security Warning dialog box that appears in the following screenshot. The Windows installer starts the Tableau installation process: Once you have opted to Run the application, the next screen prompts you to view the License Agreement and accept its terms: If you wish to read the terms of the license agreement, click on the View License Agreement… button. You can customize the installation if you'd like. Options include the directory in which the files are installed as well as the creation of a desktop icon and a Start Menu shortcut (for Windows machines). If you do not customize the installation, Tableau Public will be installed in the default directory on your computer, and the desktop icon and Start Menu shortcut will be created. Select the checkbox that indicates I have read and accept the terms of this License Agreement, and click on Install. If a User Account Control dialog box appears with the Do you want to allow the following program to install software on this computer? prompt, click on Yes: Tableau Public will be installed on your computer, with the status bar indicating the progress: When Tableau Public has been installed successfully, the home screen opens. Exploring Tableau Public The Tableau Public home screen has several features that allow you to do following operations: Connect to data Open your workbooks Discover the features of Tableau Public Tableau encourages new users to watch the video on this first welcome page. To do so, click on the button named Watch the Getting Started Video. You can start building your first Tableau Public workbook any time. Connecting to data You can connect to the following four different data source types in Tableau Public by clicking on the appropriate format name: Microsoft Excel files Text files with a variety of delimiters Microsoft Access files Odata files Summary In this article, we learned how Tableau Public is commonly used. We also learned how to download and install Tableau Public, explore Tableau Public's features and learn about the Tableau Desktop tool, and discover other authors' data visualizations using the Tableau Galleries and Recommended Authors and Profile Finder function on the Tableau website. Resources for Article: Further resources on this subject: Data Acquisition and Mapping [article] Interacting with Data for Dashboards [article] Moving from Foundational to Advanced Visualizations [article]
Read more
  • 0
  • 0
  • 4582

article-image-introduction-couchbase
Packt
04 Nov 2015
20 min read
Save for later

Introduction to Couchbase

Packt
04 Nov 2015
20 min read
In this article by Henry Potsangbam, the author of the book Learning Couchbase, we will learn that Couchbase is a NoSQL nonrelational database management system, which is different from traditional relational database management systems in many significant ways. It is designed for distributed data stores in which there are very large-scale data storage requirements (terabytes and petabytes of data). These types of data storing mechanisms might not require fixed schemas, avoid join operations, and typically scale horizontally. The main feature of Couchbase is that it is schemaless. There is no fixed schema to store data. Also, there is no join between one or more data records or documents. It allows distributed storage and utilizes computing resources, such as CPU and RAM, spanning across the  nodes that are part of the Couchbase cluster. Couchbase databases provide the following benefits: It provides a flexible data model. You don't need to worry about the schema. You can design your schema depending on the needs of your application domain and not by storage demands. It's scalable and can be done very easily. Since it's a distributing system, it can scale out horizontally without too many changes in the application. You can scale out with a few mouse clicks and rebalance it very easily. It provides high availability, since there are multiples servers and data replicated across nodes. (For more resources related to this topic, see here.) The architecture of Couchbase Couchbase clusters consist of multiple nodes. A cluster is a collection of one or more instances of the Couchbase server that are configured as a logical cluster. The following is a Couchbase server architecture diagram:  Couchbase Server Architecture As mentioned earlier, while most of the clusters' technologies work on master-slave relationships, Couchbase works on peer-to-peer node mechanism. This means there is no difference between the nodes in the cluster. The functionality provided by each node is the same. Thus, there is no single point of failure. When there is a failure of one node, another node takes up its responsibility, thus providing high availability. The data manager Any operation performed on the Couchbase database system gets stored in the memory, which acts as a caching layer. By default, every document gets stored in the memory for each read, insert, update, and so on until the memory is full. It's a drop-in replacement for Memcache. However, in order to provide persistency of the record, there is a concept called disk queue. This will flush the record to the disk asynchronously, without impacting the client request. This functionality is provided automatically by the data manager, without any human intervention. Cluster management The cluster manager is responsible for node administration and node monitoring within a cluster. Every node within a Couchbase cluster includes the cluster manager component, data storage, and data manager. It manages data storage and retrieval. It contains the memory cache layer, disk persistence mechanism, and query engine. Couchbase clients use the cluster map provided by the cluster manager to find out which node holds the required data, and then communicates with the data manager on that node to perform database operations. Buckets In RDBMS, we usually encapsulate all of the relevant data for a particular application in a database. Say, for example, we are developing an e-commerce application. We usually create a database named, e-commerce, that will be used as the logical namespace to store records in a table, such as customer or shopping cart details. It's called a bucket in a Couchbase terminology. So, whenever you want to store any document in a Couchbase cluster, you will be creating a bucket as a logical namespace as the first step. Precisely, a bucket is an independent virtual container that groups documents logically in a Couchbase cluster, which is equivalent to a database namespace in RDBMS. It can be accessed by various clients in an application. You can also configure features such as security, replication, and so on per bucket. We usually create one database and consolidate all related tables in that namespace in the RDBMS development. Likewise, in Couchbase too, you will usually create one bucket per application and encapsulate all the documents in it. Now, let me explain this concept in detail, since it's the component that administrators and developers will be working with most of the time. In fact, I used to wonder why it is named "bucket". Perhaps, we can store anything in it as we do in the physical world, hence the name "bucket". In any database system, the main purpose is to store data, and the logical namespace for storing data is called a database. Likewise, in Couchbase, the namespace for storing data is called a bucket. So in brief, it's a data container that stores data related to applications, either in the RAM or in disks. In fact, it helps you partition application data depending on an application's requirements. If you are hosting different types of applications in a cluster, say an e-commerce application and a data warehouse, you can partition them using buckets. You can create two buckets, one for the e-commerce application and another for the data warehouse. As a thumb rule, you create one bucket for each application. In an RDBMS, we store data in the forms of rows in a table, which in turn is encapsulated by a database. In Couchbase, bucket is the equivalence of database, but there is no concept of tables in Couchbase. In Couchbase, all data or records, which are referred to as documents, are stored directly in a bucket. Basically, the lowest namespace for storing document or data in Couchabase is a bucket. Internally, Couchbase arranges to store documents in different storages for different buckets. Information such as runtime statistics is collected and reported by the Couchbase cluster, grouped by the bucket type. It enables you to flush out individual buckets. You can create a separate temporary bucket rather than a regular transaction bucket when you need temporary storage for ad hoc requirements, such as reporting, temporary workspace for application programming, and so on, so that you can flush out the temporary bucket after use. The features or capabilities of a bucket depend on its type, which will be discussed subsequently. Types of buckets Couchbase provides two types of buckets, which are differentiated by the mechanism of its storage and capabilities. The two types are: Memcached Couchbase Memcached As the name suggests, buckets of the Memcached type store documents only in the RAM. This means that documents stored in the Memcache bucket are volatile in nature. Hence, such types of buckets won't survive a system reboot. Documents that are stored in such buckets will be accessible by direct address using the key-value pair mechanism. The bucket is distributed, which means that it is spread across the Couchbase cluster nodes. Since it's volatile in nature, you need to be sure of its use cases before using such types of buckets. You can use this kind of bucket to store data that is required temporarily and for better performance, since all of the data is stored in the memory but doesn't require durability. Suppose you need to display a list of countries in your application, then, instead of always fetching from the disk storage, the best way is to fetch data from the disk, populate it in the Memcached bucket, and use it in your application. In the Memcached bucket, the maximum size of a document allowed is 1 MB. All of the data is stored in the RAM, and if the bucket is running out of memory, the oldest data will be discarded. We can't replicate a Memcached bucket. It's completely compatible with the open source Memcached distributed memory object caching system. If you want to know more about the Memcached technology, you can refer to http://memcached.org/. Couchbase The Couchbase bucket type gives persistence to documents. It is distributed across a cluster of nodes and can configure replication, which is not supported in the Memcached bucket type. It's highly available, since documents are replicated across nodes in a cluster. You can verify the bucket using the web Admin UI as follows: Understanding documents By now, you must have understood the concept of buckets, its working and configuration, and so on. Let's now understand the items that get stored in it. So, what is a document? A document is a piece of information or data that gets stored in a bucket. It's the smallest item that can be stored in a bucket. As a developer, you will always be working on a bucket, in terms of documents. Documents are similar to rows in the RDBMS table schema; but in NoSQL terminologies, it will be referred to as a document. It's a way of thinking and designing data objects. All information and data should get stored as a document as it's represented in a physical document. All NoSQL databases, including Couchbase don't require a fixed schema to store documents or data in a particular bucket. These documents are represented in the form of JSON. For the time being, let's try to understand the document at a basic level. Let me show you how a document in represented in Couchbase for better clarity. You need to install the beer-sample bucket for this, which comes along with the Couchbase software installation. If you did not install it earlier, you can do it from the web console using the Settings button. The document overview The preceding screenshot shows a document, it represents a brewery and its document ID is 21st_amendment_brewery_cafe. Each document can have multiple properties/items along with its values. For example, name is the property and 21st Amendment Brewery Café is the value of the name property. So, what is this document ID? The document ID is a unique identifier that is assigned for each document in a bucket. You need to assign a unique ID whenever a document gets stored in a bucket. It's just like a primary key of a table in RDBMS. Keys and metadata As described earlier, a document key is a unique identifier for a document. The value of a document key can be any string. In addition to the key, documents usually have three more types of metadata, which are provided by the Couchbase server, unless modified by an application developer. They are as follows: rev: This is an internal revision ID meant for internal use by Couchbase. It should not be used in the application. expiration: If you want your document to expire after a certain amount of time, you can set that value here. By default, it is 0, that is, the document never expires. flags: These are numerical values specific to the client library that is updated when the document is stored. Document modeling In order to bring agility to applications that change business processes frequently, demanded by its business environment, being schemaless is a good feature. In this methodology, you don't need to be concerned about structures of data initially while designing application.This means as a developer, you don't need to worry about structures of a database schema, such as tables, or worry about splitting information into various tables, instead, you should focus on application requirement and satisfy business needs. I still recollect various moments related to design domain objects/tables, which I've been through when I was a developer, especially when I just graduated from engineering college and was into developing applications for a corporate company. Whenever I was a part of the discussions for any application requirement, at the back of the mind, I had some of these questions: How does a domain object get stored in the database? What will be the table structures? How will I retrieve the domain objects? Will it be difficult to use ORM such as Hibernate, Ejb, and so on? My point here is that instead of being mentally present in the discussion on requirement gathering and understanding the business requirements in detail, I spent more time mapping business entities in a table format. The reason being that if I did not put forward the technical constraints at that time, it would be difficult to revert about the technical challenges we could face in the data structures design later. Earlier, whenever we talked about application design, we always thought about database design structures, such as converting objects into multiple tables using normalization forms (2NF/3NF), and spent a lot of time mapping database objects to application objects using various ORM tools, such as Hibernate, Ejb, and so on. In document modeling, we will always think in terms of application requirements, that is, data or information flow while designing documents, not in terms of storage. We can simply start our application development using business representation of an entity without much concern about the storage structures. Having covered the various advantages provided by a document-based system, we will discuss in this section how to design such kinds of documents to store in any document-based database system, such as Couchbase. Then, we can effectively design domain objects for coherence with the application requirements. Whenever we model the document's structure, we need to consider two main points, one is to store all information in one document and the second is to break it down into multiple documents. You need to consider these and choose one keeping the application requirement in mind. So, an important factor is to evaluate whether the information contains unrelated data components that are independent and can be broken up into different documents or all components represent a complete domain object that could be accessed together most of the time. If data components in an information are related and will be required most of the time, together in a business logic, consider grouping them as a single logical container so that the application developer won't perceive as separate objects or documents. All of these factors depend on the nature of the application being developed and its use cases. Besides these, you need to think in terms of accessing information, such as atomicity, single unit of access, and so on. You can ask yourself a question such as, "Are we going to create or modify the information as a single unit or not?". We also need to consider concurrency, what will happen when the document is accessed by multiple clients at the same time and so on. After looking at all these considerations that you need to keep in mind while designing a document, you have two options: one is to keep all of the information in a single document, and the other is to have a separate document for every different object type. Couchbase SDK overview We have also discussed some of the guidelines used for designing document-based database system. What if we need to connect and perform operations on the Couchbase cluster in an application? This can be achieved using Couchbase client libraries, which are also collectively known as the Couchbase Software Development Kit (SDK). The Couchbase SDK APIs are language dependent. However, the concept remains the same and is applicable to all languages that are supported by the SDK. Let's now try to understand the Couchbase APIs as a concept without referring to any specific language, and then we will map these concepts to Java APIs in the Java SDK section. Couchbase SDK clients are also known as smart clients since they understand the overall status of the cluster, that is, clustermap, and keep the information of the vBucket and its server nodes updated. There are two types of Couchbase clients, as follows: Smart clients: Such clients can understand the health of the cluster and receive constant updates about the information of the cluster. Each smart client maintains a clustermap that can derive the cluster node where a document is stored using the document ID, for example, Java, .NET, and so on. Memcached-compatible: Such clients are used for applications that would be interacting with the traditional memcached bucket, which is not aware of vBucket. It needs to install Moxi (a memcached proxy) on all clients that require access to the Couchbase memcache bucket, which act as a proxy to convert the API's call to the memcache compatible call. Understanding the write operation in the Couchbase cluster Let's understand how the write operation works in the Couchbase cluster. When a write command is issued using the set operation on the Couchbase cluster, the server immediately responds once the document is written to the memory of that particular node. How do clients know which nodes in the cluster will be responsible for storing the document? You might recall that every operation requires a document ID, using this document ID, the hash algorithm determines the vBucket in which it belongs. Then, this vBucket is used to determine the node that will store the document. All mapping information, vBucket to node, is stored in each of the Couchbase client SDKs, which form the clustermap. Views Whenever we want to extract fields from JSON documents without document ID, we use views. If you want to find a document or fetch information about a document with attributes or fields of a document other than the document ID, a view is the way to go for it. Views are written in the form of MapReduce, which we have discussed earlier, that is, it consists of map and reduce phase. Couchbase implements MapReduce using the JavaScript language. The following diagram shows you how various documents are passed through the View engine to produce an index. The View engine ensures that all documents in the bucket are passed through the map method for processing and subsequently to reduce function to create indexes.   When we write views, the View Engine defines materialized views for JSON documents and then queries across the dataset in the bucket. Couchbase provides a view processor to process the entire documents with map and reduce methods defined by the developer to create views. The views are maintained locally by each node for the documents stored in that particular node. Views are created for documents that are stored on the disk only. A view's life cycle A view has its own life cycle. You need to define, build, and query it, as shown in this diagram:   View life cycle  Initially, you will define the logic of MapReduce and build it on each node for each document that is stored locally. In the build phase, we usually emit those attributes that need to be part of indexes. Views usually work on JSON documents only. If documents are not in the JSON format or the attributes that we emit in the map function are not part of the document, then the document is ignored during the generation of views by the view engine. Finally, views are queried by clients to retrieve and find documents. After the completion of this cycle, you can still change the definition of MapReduce. For that, you need to bring the view to development mode and modify it. Thus, you have the view cycle as shown in the preceding diagram while developing a view.   The preceding code shows a view. A view has predefined syntax. You can't change the method signature. Here, it follows the functional programming syntax. The preceding code shows a map method that accepts two parameters: doc: This represents the entire document meta: This represents the metadata of the document Each map will return some objects in the form of key and value. This is represented by the emit() method. The emit() method returns key and value. However, value will usually be null. Since, we can retrieve a document using the key, it's better to use that instead of using the value field of the emit() method. Custom reduce functions Why do we need custom reduce functions? Sometimes, the built-in reduce function doesn’t meet our requirements, although it will suffice most of the time. Custom reduce functions allow you to create your own reduce function. In such a reduce function, the output of map function goes to the corresponding reduce function group as per the key of the map output and the group level parameter. Couchbase ensures that output from the map will be grouped by key and supplied to reduce. Then it’s the developer's role to define logic in reduce, what to perform on the data such as aggregating, addition, and so on. To handle the incremental MapReduce functionality (that is, updating an existing view), each function must also be able to handle and consume its own output. In an incremental situation, the function must handle both new records and previously computed reductions. The input to the reduce function can be not only raw data from the map phase, but also the output of a previous reduce phase. This is called re-reduce and can be identified by the third argument of reduce(). When the re-reduce argument is false, both the key and value arguments are arrays, the value argument array matches the corresponding element with that of array of key. For example, the key[1] is the key of value[1]. The map to reduce function execution is shown as follows: Map reduce execution in a view N1QL overview So far, you have learned how to fetch documents in two ways: using document ID and views. The third way of retrieving documents is by using N1QL, pronounced as Nickel. Personally, I feel that it is a great move by Couchbase to provided SQL-like syntax, since most engineers and IT professionals are quite familiar with SQL, which is usually part of their formal education. It brings confidence in them and also provides ease of using Couchbase in their applications. Moreover, it provides most database operational activities related to development. N1QL can be used to: Store documents, that is, the INSERT command Fetch documents, that is, the SELECT command Prior to the advent of N1QL, developers needed to perform key-based operations, which was quite complex when it came to retrieving information using views and custom reduce. With the previously available options, developers needed to know the key before performing any operation on the document, which would not be the case all the time. Before N1QL features were incorporated in Couchbase, you could not perform ad hoc queries on documents in a bucket until you created views on it. Moreover, sometimes we need to perform joins or searches in the bucket, which is not possible using the document ID and views. All of these drawbacks are addressed in N1QL. I would rather say that N1QL features as an evolution in the Couchbase history. Understanding the N1QL syntax Most N1QL queries will be in the following format: SELECT [DISTINCT] <expression> FROM <data source> WHERE <expression> GROUP BY <expression> ORDER BY <expression> LIMIT <number> OFFSET <number> The preceding statement is very generic. It tells you the comprehensive options provided by N1QL in one syntax: SELECT * FROM LearningCouchbase This selects the entire document store in the bucket, LearningCouchbase. Here, we have fetched all the documents in the LearningCouchbase bucket. The output of the query is shown here; it is in the JSON document format only. All documents returned by the N1QL query will be in the array values format of the attribute, resultset. Summary You learned how to design a document base data schema and connect using connection polling from a Java base application to Couchbase. You also understood how to retrieve data from it using MapReduce based views, and you understood SQL such as syntax, N1QL to extract documents from the Couchbase database, and bucket and perform high available features with XDCR. It will also enable you to perform a full text search by integrating Elasticsearch plugins. Resources for Article: Further resources on this subject: MAPREDUCE FUNCTIONS [article] PUTTING YOUR DATABASE AT THE HEART OF AZURE SOLUTIONS [article] MOVING SPATIAL DATA FROM ONE FORMAT TO ANOTHER [article]
Read more
  • 0
  • 0
  • 2738

article-image-big-data-analytics
Packt
03 Nov 2015
10 min read
Save for later

Big Data Analytics

Packt
03 Nov 2015
10 min read
In this article, Dmitry Anoshin, the author of Learning Hunk will talk about Hadoop—how to extract Hunk to VM to set up a connection with Hadoop to create dashboards. We are living in a century of information technology. There are a lot of electronic devices around us that generate a lot of data. For example, you can surf on the Internet, visit a couple news portals, order new Airmax on the web store, write a couple of messages to your friend, and chat on Facebook. Every action produces data; we can multiply these actions with the number of people who have access to the Internet or just use a mobile phone and we will get really big data. Of course, you have a question, how big is it? I suppose, now it starts from terabytes or even petabytes. The volume is not the only issue; we struggle with a variety of data. As a result, it is not enough to analyze only structure data. We should dive deep into the unstructured data, such as machine data, that are generated by various machines. World famous enterprises try to collect this extremely big data in order to monetize it and find business insights. Big data offers us new opportunities, for example, we can enrich customer data via social networks using the APIs of Facebook or Twitter. We can build customer profiles and try to predict customer wishes in order to sell our product or improve customer experience. It is easy to say, but difficult to do. However, organizations try to overcome these challenges and use big data stores, such as Hadoop. (For more resources related to this topic, see here.) The big problem Hadoop is a distributed file system and framework to compute. It is relatively easy to get data into Hadoop. There are plenty of tools to get data into different formats. However, it is extremely difficult to get value out of these data that you put into Hadoop. Let's look at the path from data to value. First, we have to start at the collection of data. Then, we also spend a lot of time preparing and making sure this data is available for analysis while being able to ask questions to this data. It looks as follows: Unfortunately, the questions that you asked are not good or the answers that you got are not clear, and you have to repeat this cycle over again. Maybe, you have transformed and formatted your data. In other words, it is a long and challenging process. What you actually want is something to collect data; spend some time preparing the data, then you would able to ask question and get answers from data repetitively. Now, you can spend a lot of time asking multiple questions. In addition, you are able to iterate with data on those questions to refine the answers that you are looking for. The elegant solution What if we could take Splunk and put it on top of all these data stored in Hadoop? And it was, what the Splunk company actually did. The following figure shows how we got Hunk as name of the new product: Let's discuss some solution goals Hunk inventors were thinking about when they were planning Hunk: Splunk can take data from Hadoop via the Splunk Hadoop Connection app. However, it is a bad idea to copy massive data from Hadoop to Splunk. It is much better to process data in place because Hadoop provides both storage and computation and why not take advantage of both. Splunk has extremely powerful Splunk Processing Language (SPL) and it is a kind of advantage of Splunk, because it has a wide range of analytic functions. This is why it is a good idea to keep SPL in the new product. Splunk has true schema on the fly. The data that we store in Hadoop changes constantly. So, Hunk should be able to build schema on the fly, independent from the format of the data. It's a very good idea to have the ability to make previews. As you know, when a search is going on, you would able to get incremental results. It can dramatically reduce the outage. For example, we don't need to wait till the MapReduce job is finished. We can look at the incremental result and, in the case of a wrong result, restart a search query. The deployment of Hadoop is not easy; Splunk tries to make the installation and configuration of Hunk easy for us. Getting up Hunk In order to start exploring the Hadoop data, we have to install Hunk on the top of our Hadoop cluster. Hunk is easy to install and configure. Let's learn how to deploy Hunk Version 6.2.1 on top of the existing CDH cluster. It's assumed that your VM is up and running. Extracting Hunk to VM To extract Hunk to VM, perform the following steps: Open the console application. Run ls -la to see the list of files in your Home directory: [cloudera@quickstart ~]$ cd ~ [cloudera@quickstart ~]$ ls -la | grep hunk -rw-r--r--   1 root     root     113913609 Mar 23 04:09 hunk-6.2.1-249325-Linux-x86_64.tgz Unpack the archive: cd /opt sudo tar xvzf /home/cloudera/hunk-6.2.1-249325-Linux-x86_64.tgz -C /opt Setting up Hunk variables and configuration files Perform the following steps to set up the Hunk variables and configuration files It's time to set the SPLUNK_HOME environment variable. This variable is already added to the profile; it is just to bring to your attention that this variable must be set: export SPLUNK_HOME=/opt/hunk Use default splunk-launch.conf. This is the basic properties file used by the Hunk service. We don't have to change there something special, so let's use the default settings: sudocp /opt/hunk/etc/splunk-launch.conf.default /opt/hunk//etc/splunk-launch.conf Running Hunk for the first time Perform the following steps to run Hunk: Run Hunk: sudo /opt/hunk/bin/splunk start --accept-license Here is the sample output from the first run: sudo /opt/hunk/bin/splunk start --accept-license This appears to be your first time running this version of Splunk. Copying '/opt/hunk/etc/openldap/ldap.conf.default' to '/opt/hunk/etc/openldap/ldap.conf'. Generating RSA private key, 1024 bit long modulus Some output lines were deleted to reduce amount of log text Waiting for web server at http://127.0.0.1:8000 to be available.... Done If you get stuck, we're here to help. Look for answers here: http://docs.splunk.com The Splunk web interface is at http://vm-cluster-node1.localdomain:8000 Setting up a data provider and virtual index for the CDR data We need to accomplish two tasks: provide a technical connector to the underlying data storage and create a virtual index for the data on this storage. Log in to http://quickstart.cloudera:8000. The system would ask you to change the default admin user password. I did set it to admin: Setting up a connection to Hadoop Right now, we are ready to set up the integration between Hadoop and Hunk. At first, we need to specify the way Hunk connects to the current Hadoop installation. We are using the most recent way: YARN with MR2. Then, we have to point virtual indexes to the data stored on Hadoop. To do this, perform the following steps: Click on Explore Data. Click on Create a provider: Let's fill the form to create the data provider: Property name Value Name hadoop-hunk-provider Java home /usr/java/jdk1.7.0_67-cloudera Hadoop home /usr/lib/hadoop Hadoop version Hadoop 2.x, (Yarn) filesystem hdfs://quickstart.cloudera:8020 Resource Manager Address quickstart.cloudera:8032 Resource Scheduler Address quickstart.cloudera:8030 HDFS Working Directory /user/hunk Job Queue default You don't have to modify any other properties. The HDFS working directory has been created for you in advance. You can create it using the following command: sudo -u hdfshadoop fs -mkdir -p /user/hunk If you did everything correctly, you should see a screen similar to the following screenshot: Let's discuss briefly what we have done: We told Hunk where Hadoop home and Java are. Hunk uses Hadoop streaming internally, so it needs to know how to call Java and Hadoop streaming. You can inspect the submitted jobs from Hunk (discussed later) and see the following lines: /opt/hunk/bin/jars/sudobash /usr/bin/hadoop jar "/opt/hunk/bin/jars/SplunkMR-s6.0-hy2.0.jar" "com.splunk.mr.SplunkMR" MapReduce JAR is submitted by Hunk. Also, we need to tell Hunk where the YARN Resource Manager and Scheduler are located. These services allow us to ask for cluster resources and run jobs. Job queue could be useful in the production environment. You could have several queues for cluster resource distribution in real life. We would set queue name as default, since we are not discussing cluster utilization and load balancing. Setting up a virtual index for the data stored in Hadoop Now it's time to create virtual index. We are going to add the dataset with the avro files to the virtual index as an example data. Click on Explore Data and then click on Create a virtual index: You'll get a message telling that there are no indexes: Just click on New Virtual Index. A virtual index is a metadata. It tells Hunk where the data is located and what provider should be used to read the data. Property name Value Name milano_cdr_aggregated_10_min_activity Path to data in HDFS /masterdata/stream/milano_cdr Here is an example screen you should see after you create your first virtual index: Accessing data through the virtual index To access data through the virtual index, perform the following steps: Click on Explore Data and select a provider and virtual index: Select part-m-00000.avro by clicking on it. The Next button will be activated after you pick up a file: Preview data in the Preview Data tab. You should see how Hunk automatically for timestamp from our CDR data: Pay attention to the Time column and the field named Time_interval from the Event column. The time_interval column keeps the time of record. Hunk should automatically use that field as a time field: Save the source type by clicking on Save As and then Next: In the Entering Context Settings page, select search in the App context drop-down box. Then, navigate to Sharing context | All apps and then click on Next. The last step allows you to review what we've done: Click on Finish to create the finalized wizard. Creating a dashbord Now it's time to see how the dashboards work. Let's find the regions where the visitors face problems (status = 500) while using our online store: index="digital_analytics" status=500 | iplocation clientip | geostats latfield=lat longfield=lon count by Country You should see the map and the portions of error for the countries: Now let's save it as dashboard. Click on Save as and select Dashboard panel from drop-down menu. Name it as Web Operations. You should get a new dashboard with a single panel and our report on it. We have several previously created reports. Let's add them to the newly created dashboard using separate panels: Click on Edit and then Edit panels. Select Add new panel and then New from report, and add one of our previous reports. Summary In this article, you learned how to extract Hunk to VM. We also saw how to set up Hunk variables and configuration files. You learned how to run Hunk and how to set up the data provided and a virtual index for the CDR data. Setting up a connection to Hadoop and a virtual index for the data stored in Hadoop were also covered in detail. Apart from these, you also learned how to create a dashboard. Resources for Article: Further resources on this subject: Identifying Big Data Evidence in Hadoop [Article] Big Data [Article] Understanding Hadoop Backup and Recovery Needs [Article]
Read more
  • 0
  • 0
  • 2230
article-image-protecting-your-bitcoins
Packt
29 Oct 2015
32 min read
Save for later

Protecting Your Bitcoins

Packt
29 Oct 2015
32 min read
In this article by Richard Caetano author of the book Learning Bitcoin, we will explore ways to safely hold your own bitcoin. We will cover the following topics: Storing your bitcoins Working with brainwallet Understanding deterministic wallets Storing Bitcoins in cold storage Good housekeeping with Bitcoin (For more resources related to this topic, see here.) Storing your bitcoins The banking system has a legacy of offering various financial services to its customers. They offer convenient ways to spend money, such as cheques and credit cards, but the storage of money is their base service. For many centuries, banks have been a safe place to keep money. Customers rely on the interest paid on their deposits, as well as on the government insurance against theft and insolvency. Savings accounts have helped make preserving the wealth easy, and accessible to a large population in the western world. Yet, some people still save a portion of their wealth as cash or precious metals, usually in a personal safe at home or in a safety deposit box. They may be those who have, over the years, experienced or witnessed the downsides of banking: government confiscation, out of control inflation, or runs on the bank. Furthermore, a large population of the world does not have access to the western banking system. For those who live in remote areas or for those without credit, opening a bank account is virtually impossible. They must handle their own money properly to prevent loss or theft. In some places of the world, there can be great risk involved. These groups of people, who have little or no access to banking, are called the "underbanked". For the underbanked population, Bitcoin offers immediate access to a global financial system. Anyone with access to the internet or who carries a mobile phone with the ability to send and receive SMS messages, can hold his or her own bitcoin and make global payments. They can essentially become their own bank. However, you must understand that Bitcoin is still in its infancy as a technology. Similar to the Internet of circa 1995, it has demonstrated enormous potential, yet lacks usability for a mainstream audience. As a parallel, e-mail in its early days was a challenge for most users to set up and use, yet today it's as simple as entering your e-mail address and password on your smartphone. Bitcoin has yet to develop through these stages. Yet, with some simple guidance, we can already start realizing its potential. Let's discuss some general guidelines for understanding how to become your own bank. Bitcoin savings In most normal cases, we only keep a small amount of cash in our hand wallets to protect ourselves from theft or accidental loss. Much of our money is kept in checking or savings accounts with easy access to pay our bills. Checking accounts are used to cover our rent, utility bills, and other payments, while our savings accounts hold money for longer-term goals, such as a down payment on buying a house. It's highly advisable to develop a similar system for managing your Bitcoin money. Both local and online wallets provide a convenient way to access your bitcoins for day-to-day transactions. Yet there is the unlikely risk that one could lose his or her Bitcoin wallet due to an accidental computer crash or faulty backup. With online wallets, we run the risk of the website or the company becoming insolvent, or falling victim to cybercrime. By developing a reliable system, we can adopt our own personal 'Bitcoin Savings' account to hold our funds for long-term storage. Usually, these savings are kept offline to protect them from any kind of computer hacking. With protected access to our offline storage, we can periodically transfer money to and from our savings. Thus, we can arrange our Bitcoin funds much as we manage our money with our hand wallets and checking/savings accounts. Paper wallets As explained, a private key is a large random number that acts as the key to spend your bitcoins. A cryptographic algorithm is used to generate a private key and, from it, a public address. We can share the public address to receive bitcoins, and, with the private key, spend the funds sent to the address. Generally, we rely on our Bitcoin wallet software to handle the creation and management of our private keys and public addresses. As these keys are stored on our computers and networks, they are vulnerable to hacking, hardware failures, and accidental loss. Private keys and public addresses are, in fact, just strings of letters and numbers. This format makes it easy to move the keys offline for physical storage. Keys printed on paper are called "paper wallet" and are highly portable and convenient to store in a physical safe or a bank safety deposit box. With the private key generated and stored offline, we can safely send bitcoin to its public address. A paper wallet must include at least one private key and its computed public address. Additionally, the paper wallet can include a QR code to make it convenient to retrieve the key and address. Figure 3.1 is an example of a paper wallet generated by Coinbase: Figure 3.1 - Paper wallet generated from Coinbase The paper wallet includes both the public address (labeled Public key) and the private key, both with QR codes to easily transfer them back to your online wallet. Also included on the paper wallet is a place for notes. This type of wallet is easy to print for safe storage. It is recommended that copies are stored securely in multiple locations in case the paper is destroyed. As the private key is shown in plain text, anyone who has access to this wallet has access to the funds. Do not store your paper wallet on your computer. Loss of the paper wallet due to hardware failure, hacking, spyware, or accidental loss can result in complete loss of your bitcoin. Make sure you have multiple copies of your wallet printed and securely stored before transferring your money. One time use paper wallets Transactions from bitcoin addresses must include the full amount. When sending a partial amount to a recipient, the remaining balance must be sent to a change address. Paper wallet that includes only one private key are considered to be "one time use" paper wallet. While you can always send multiple transfers of bitcoin to the wallet, it is highly recommended that you spend the coins only once. Therefore, you shouldn't move a large number of bitcoins to the wallet expecting to spend a partial amount. With this in mind, when using one-time use paper wallet, it's recommended that you only save a usable amount to each wallet. This amount could be a block of coins that you'd like to fully redeem to your online wallet. Creating a paper wallet To create a paper wallet in Coinbase, simply log in with your username and password. Click on the Tools link on the left-hand side menu. Next, click on the Paper Wallets link from the above menu. Coinbase will prompt you to Generate a paper wallet and Import a paper wallet. Follow the links to generate a paper wallet. You can expect to see the paper wallet rendered, as shown in the following figure 3.2: Figure 3.2 - Creating a paper wallet with Coinbase Coinbase generates your paper wallet completely from your browser, without sending the private key back to its server. This is important to protect your private key from exposure to the network. You are generating the only copy of your private key. Make sure that you print and securely store multiple copies of your paper wallet before transferring any money to it. Loss of your wallet and private key will result in the loss of your bitcoin. By clicking the Regenerate button, you can generate multiple paper wallets and store various amounts of bitcoin on each wallet. Each wallet is easily redeemable in full at Coinbase or with other bitcoin wallet services. Verifying your wallet's balance After generating and printing multiple copies of your paper wallet, you're ready to transfer your funds. Coinbase will prompt you with an easy option to transfer the funds from your Coinbase wallet to your paper wallet: Figure 3.3 - Transferring funds to your paper wallet Figure 3.3 shows Coinbase's prompt to transfer your funds. It provides options to enter your amount in BTC or USD. Simply specify your amount and click Send. Note that Coinbase only keeps a copy of your public address. You can continue to send additional amounts to your paper wallet using the same public address. For your first time working with paper wallets, it's advisable that you only send small amounts of bitcoin, to learn and experiment with the process. Once you feel comfortable with creating and redeeming paper wallets, you can feel secure with transferring larger amounts. To verify that the funds have been moved to your paper wallet, we can use a blockchain explorer to verify that the funds have been confirmed by the network. Blockchain explorers make all the transaction data from the Bitcoin network available for public review. We'll use a service called Blockchain.info to verify our paper wallet. Simply open www.blockchain.info in your browser and enter the public key from your paper wallet in the search box. If found, Blockchain.info will display a list of the transaction activities on that address: Figure 3.4 - Blockchain.info showing transaction activity Shown in figure 3.4 is the transaction activity for the address starting with 16p9Lt. You can quickly see the total bitcoin received and the current balance. Under the Transactions section, you can find the details of the transactions recorded by the network. Also listed are the public addresses that were combined by the wallet software, as well as the change address used to complete the transfer. Note that at least six confirmations are required before the transaction is considered confirmed. Importing versus sweeping When importing your private key, the wallet software will simply add the key to its list of private keys. Your bitcoin wallet will manage your list of private keys. When sending money, it will combine the balances from multiple addresses to make the transfer. Any remaining amount will be sent back to the change address. The wallet software will automatically manage your change addresses. Some Bitcoin wallets offer the ability to sweep yourc private key. This involves a second step. After importing your private key, the wallet software will make a transaction to move the full balance of your funds to a new address. This process will empty your paper wallet completely. The step to transfer the funds may require additional time to allow the network to confirm your transaction. This process could take up to one hour. In addition to the confirmation time, a small miner's fee may be applied. This fee could be in the amount of 0.0001BTC. If you are certain that you are the only one with access to the private key, it is safe to use the import feature. However, if you believe someone else may have access to the private key, sweeping is highly recommended. Listed in the following table are some common bitcoin wallets which support importing a private key: Bitcoin Wallet Comments Sweeping Coinbase https://www.coinbase.com/ This provides direct integration between your online wallet and your paper wallet. No Electrum https://electrum.org This provides the ability to import and see your private key for easy access to your wallet's funds. Yes Armory https://bitcoinarmory.com/ This provides the ability to import your private key or "sweep" the entire balance. Yes Multibit https://multibit.org/ This directly imports your private key. It may use a built-in address generator for change addresses. No Table 1 - Wallets that support importing private keys Importing your Paper wallet To import your wallet, simply log into your Coinbase account. Click on Tools from the left-hand side menu, followed by Paper Wallet from the top menu. Then, click on the Import a paper wallet button. You will be prompted to enter the private key of your paper wallet, as show in figure 3.5: Figure 3.5 - Coinbase importing from a paper wallet Simply enter the private key from your paper wallet. Coinbase will validate the key and ask you to confirm your import. If accepted, Coinbase will import your key and sweep your balance. The full amount will be transferred to your bitcoin wallet and become available after six confirmations. Paper wallet guidelines Paper wallets display your public and private keys in plain text. Make sure that you keep these documents secure. While you can send funds to your wallet multiple times, it is highly recommended that you spend your balance only once and in full. Before sending large amounts of bitcoin to a paper wallet, make sure you are able to test your ability to generate and import the paper wallet with small amounts of bitcoin. When you're comfortable with the process, you can rely on them for larger amounts. As paper is easily destroyed or ruined, make sure that you keep multiple copies of your paper wallet in different locations. Make sure the location is secure from unwanted access. Be careful with online wallet generators. A malicious site operator can obtain the private key from your web browser. Only use trusted paper wallet generators. You can test the online paper wallet generator by opening the page in your browser while online, and then disconnecting your computer from the network. You should be able to generate your paper wallet when completely disconnected from the network, ensuring that your private keys are never sent back to the network. Coinbase is an exception in the fact that it only sends the public address back to the server for reference. This public address is saved to make it easy to transfer funds to your paper wallet. The private key is never saved by Coinbase when generating a paper wallet. Paper wallet services In addition to the services mentioned, there are other services that make paper wallets easy to generate and print. Listed next in Table 2 are just a few: Service Notes BitAddress bitaddress.org This offers the ability to generate single wallets, bulk wallets, brainwallets, and more. Bitcoin Paper Wallet bitcoinpaperwallet.com This offers nice, stylish design, and easy-to-use features. Users can purchase holographic stickers securing the paper wallets. Wallet Generator walletgenerator.net This offers printable paper wallets that fold nicely to conceal the private keys.  Table 2 - Services for generating paper wallets and brainwallets Brainwallets Storing our private keys offline by using a paper wallet is one way we can protect our coins from attacks on the network. Yet, having a physical copy of our keys is similar to holding a gold bar: it's still vulnerable to theft if the attacker can physically obtain the wallet. One way to protect bitcoins from online or offline theft is to have the codes recallable by memory. As holding long random private keys in memory is quite difficult, even for the best of minds, we'll have to use another method to generate our private keys. Creating a brainwallet Brainwallet is a way to create one or more private keys from a long phrase of random words. From the phrase, called a passphrase, we're able to generate a private key, along with its public addresses, to store bitcoin. We can create any passphrase we'd like. The longer the phrase and the more random the characters, the more secure it will be. Brainwallet phrases should contain at least 12 words. It is very important that the phrase should never come from anything published, such as a book or a song. Hackers actively search for possible brainwallets by performing brute force attacks on commonly-published phrases. Here is an example of a brainwallet passphrase: gently continue prepare history bowl shy dog accident forgive strain dirt consume Note that the phrase is composed of 12 seemingly random words. One could use an easy-to-remember sentence rather than 12 words. Regardless of whether you record your passphrase on paper or memorize it, the idea is to use a passphrase that's easy to recall and type, yet difficult to crack. Don't let this happen to you: "Just lost 4 BTC out of a hacked brain wallet. The pass phrase was a line from an obscure poem in Afrikaans. Somebody out there has a really comprehensive dictionary attack program running." Reddit Thread (http://redd.it/1ptuf3) Unfortunately, this user lost their bitcoin because they chose a published line from a poem. Make sure that you choose a passphrase that is composed of multiple components of non-published text. Sadly, although warned, some users may resort to simple phrases that are easy to crack. Simple passwords such as 123456, password1, and iloveyou are still commonly used with e-mails, and login accounts are routinely cracked. Do not use simple passwords for your brainwallet passphrase. Make sure that you use at least 12 words with additional characters and numbers. Using the preceding paraphrase, we can generate our private key and public address using the many tools available online. We'll use an online service called BitAddress to generate the actual brainwallet from the passphrase. Simply open www.bitaddress.org in your browser. At first, BitAddress will ask you to move your mouse cursor around to collect enough random points to generate a seed for generating random numbers. This process could take a minute or two. Once opened, select the option Brain Wallet from the top menu. In the form presented, enter the passphrase and then enter it again to confirm. Click on View to see your private key and public address. For the example shown in figure 3.6, we'll use the preceding passphrase example: Figure 3.6 - BitAddress's brainwallet feature From the page, you can easily copy and paste the public address and use it for receiving Bitcoin. Later, when you're ready to spend the coins, enter the same exact passphrase to generate the same private key and public address. Referring to our Coinbase example from earlier in the article, we can then import the private key into our wallet. Increasing brainwallet Security As an early attempt to give people a way to "memorize" their Bitcoin wallet, brainwallets have become a target for hackers. Some users have chosen phrases or sentences from common books as their brainwallet. Unfortunately, the hackers who had access to large amounts of computing power were able to search for these phrases and were able to crack some brainwallets. To improve the security of brainwallets, other methods have been developed which make brainwallets more secure. One service, called brainwallet.io, executes a time-intensive cryptographic function over the brainwallet phrase to create a seed that is very difficult to crack. It's important to know that the phase phrases used with BitAddress are not compatible with brainwallet.io. To use brainwallet.io to generate a more secure brainwallet, open http://brainwallet.io: Figure 3.7 - brainwallet.io, a more secure brainwallet generator Brainwallet.io needs a sufficient amount of entropy to generate a private key which is difficult to reproduce. Entropy, in computer science, can describe data in terms of its predictability. When data has high entropy, it could mean that it's difficult to reproduce from known sources. When generating private keys, it's very important to use data that has high entropy. For generating brainwallet keys, we need data with high entropy, yet it should be easy for us to duplicate. To meet this requirement, brainwallet.io accepts your random passphrase, or can generate one from a list of random words. Additionally, it can use data from a file of your choice. Either way, the more entropy given, the stronger your passphrase will be. If you specify a passphrase, choosing at least 12 words is recommended. Next, brainwallet.io prompts you for salt, available in several forms: login info, personal info, or generic. Salts are used to add additional entropy to the generation of your private key. Their purpose is to prevent standard dictionary attacks against your passphrase. While using brainwallet.io, this information is never sent to the server. When ready, click the generate button, and the page will begin computing a scrypt function over your passphrase. Scrypt is a cryptographic function that requires computing time to execute. Due to the time required for each pass, it makes brute force attacks very difficult. brainwallet.io makes many thousands of passes to ensure that a strong seed is generated for the private key. Once finished, your new private key and public address, along with their QR codes, will be displayed for easy printing. As an alternative, WarpWallet is also available at https://keybase.io/warp. WarpWallet also computes a private key based on many thousands of scrypt passes over a passphrase and salt combination. Remember that brainwallet.io passphrases are not compatible with WarpWallet passphrases. Deterministic wallets We have introduced brainwallets that yield one private key and public address. They are designed for one time use and are practical for holding a fixed amount of bitcoin for a period of time. Yet, if we're making lots of transactions, it would be convenient to have the ability to generate unlimited public addresses so that we can use them to receive bitcoin from different transactions or to generate change addresses. A Type 1 Deterministic Wallet is a simple wallet schema based on a passphrase with an index appended. By incrementing the index, an unlimited number of addresses can be created. Each new address is indexed so that its private key can be quickly retrieved. Creating a deterministic wallet To create a deterministic wallet, simply choose a strong passphrase, as previously described, and then append a number to represent an individual private key and public address. It's practical to do this with a spreadsheet so that you can keep a list of public addresses on file. Then, when you want to spend the bitcoin, you simply regenerate the private key using the index. Let's walk through an example. First, we choose the passphrase: "dress retreat save scratch decide simple army piece scent ocean hand become" Then, we append an index, sequential number, to the passphrase: "dress retreat save scratch decide simple army piece scent ocean hand become0" "dress retreat save scratch decide simple army piece scent ocean hand become1" "dress retreat save scratch decide simple army piece scent ocean hand become2" "dress retreat save scratch decide simple army piece scent ocean hand become3" "dress retreat save scratch decide simple army piece scent ocean hand become4" Then, we take each passphrase, with the corresponding index, and run it through brainwallet.io, or any other brainwallet service, to generate the public address. Using a table or a spreadsheet, we can pre-generate a list of public addresses to receive bitcoin. Additionally, we can add a balance column to help track our money: Index Public Address Balance 0 1Bc2WZ2tiodYwYZCXRRrvzivKmrGKg2Ub9 0.00 1 1PXRtWnNYTXKQqgcxPDpXEvEDpkPKvKB82 0.00 2 1KdRGNADn7ipGdKb8VNcsk4exrHZZ7FuF2 0.00 3 1DNfd491t3ABLzFkYNRv8BWh8suJC9k6n2 0.00 4 17pZHju3KL4vVd2KRDDcoRdCs2RjyahXwt 0.00  Table 3 - Using a spreadsheet to track deterministic wallet addresses Spending from a deterministic wallet When we have money available in our wallet to spend, we can simply regenerate the private key for the index matching the public address. For example, let's say we have received 2BTC on the address starting with 1KdRGN in the preceding table. Since we know it belongs to index #2, we can reopen the brainwallet from the passphrase: "dress retreat save scratch decide simple army piece scent ocean hand become2" Using brainwallet.io as our brainwallet service, we quickly regenerate the original private key and public address: Figure 3.8 - Private key re-generated from a deterministic wallet Finally, we import the private key into our Bitcoin wallet, as described earlier in the article. If we don't want to keep the change in our online wallet, we can simply send the change back to the next available public address in our deterministic wallet. Pre-generating public addresses with deterministic wallets can be useful in many situations. Perhaps you want to do business with a partner and want to receive 12 payments over the course of one year. You can simply regenerate the 12 addresses and keep track of each payment using a spreadsheet. Another example could apply to an e-commerce site. If you'd like to receive payment for the goods or services being sold, you can pre-generate a long list of addresses. Only storing the public addresses on your website protects you from malicious attack on your web server. While Type 1 deterministic wallets are very useful, we'll introduce a more advanced version called the Type 2 Hierarchical Deterministic Wallet next. Type 2 Hierarchical Deterministic wallets Type 2 Hierarchical Deterministic (HD) wallets function similarly to Type 1 deterministic wallets, as they are able to generate an unlimited amount of private keys from a single passphrase, but they offer more advanced features. HD wallets are used by desktop, mobile, and hardware wallets as a way of securing an unlimited number of keys by a single passphrase. HD wallets are secured by a root seed. The root seed, generated from entropy, can be a number up to 64 bytes long. To make the root seed easier to save and recover, a phrase consisting of a list of mnemonic code words is rendered. The following is an example of a root seed: 01bd4085622ab35e0cd934adbdcce6ca To render the mnemonic code words, the root seed number plus its checksum is combined and then divided into groups of 11 bits. Each group of bits represents an index between 0 and 2047. The index is then mapped to a list of 2,048 words. For each group of bits, one word is listed, as shown in the following example, which generates the following phrase: essence forehead possess embarrass giggle spirit further understand fade appreciate angel suffocate BIP-0039 details the specifications for creating mnemonic code words to generate a deterministic key, and is available at https://en.bitcoin.it/wiki/BIP_0039. In the HD wallet, the root seed is used to generate a master private key and a master chain code. The master private key is used to generate a master public key, as with normal Bitcoin private keys and public keys. These keys are then used to generate additional children keys in a tree-like structure. Figure 3.9 illustrates the process of creating the master keys and chain code from a root seed: Figure 3.9 - Generating an HD Wallet's root seed, code words, and master keys Using a child key derivation function, children keys can be generated from the master or parent keys. An index is then combined with the keys and the chain code to generate and organize parent/child relationships. From each parent, two billion children keys can be created, and from each child's private key, the public key and public address can be created. In addition to generating a private key and a public address, each child can be used as a parent to generate its own list of child keys. This allows the organization of the derived keys in a tree-like structure. Hierarchically, an unlimited amount of keys can be created in this way. Figure 3.10 - The relationship between master seed, parent/child chains, and public addresses HD wallets are very practical as thousands of keys and public addresses can be managed by one seed. The entire tree of keys can be backed up and restored simply by the passphrase. HD wallets can be organized and shared in various useful ways. For example, in a company or organization, a parent key and chain code could be issued to generate a list of keys for each department. Each department would then have the ability to render its own set of private/public keys. Alternatively, a public parent key can be given to generate child public keys, but not the private keys. This can be useful in the example of an audit. The organization may want the auditor to perform a balance sheet on a set of public keys, but without access to the private keys for spending. Another use case for generating public keys from a parent public key is for e-commerce. As an example mentioned previously, you may have a website and would like to generate an unlimited amount of public addresses. By generating a public parent key for the website, the shopping card can create new public addresses in real time. HD wallets are very useful for Bitcoin wallet applications. Next, we'll look at a software package called Electrum for setting up an HD wallet to protect your bitcoins. Installing a HD wallet HD wallets are very convenient and practical. To show how we can manage an unlimited number of addresses by a single passphrase, we'll install an HD wallet software package called Electrum. Electrum is an easy-to-use desktop wallet that runs on Windows, OS/X, and Linux. It implements a secure HD wallet that is protected by a 12-word passphrase. It is able to synchronize with the blockchain, using servers that index all the Bitcoin transactions, to provide quick updates to your balances. Electrum has some nice features to help protect your bitcoins. It supports multi-signature transactions, that is transactions that require more than one key to spend coins. Multi-signature transactions are useful when you want to share the responsibility of a Bitcoin address between two or more parties, or to add an extra layer of protection to your Bitcoins. Additionally, Electrum has the ability to create a watching-only version of your wallet. This allows you to give access to your public keys to another party without releasing the private keys. This can be very useful for auditing or accounting purposes. To install Electrum, simply open the url https://electrum.org/#download and follow the instructions for your operating system. On first installation, Electrum will create for you a new wallet identified by a passphrase. Make sure that you protect this passphrase offline! Figure 3.11 - Recording the passphrase from an Electrum wallet Electrum will proceed by asking you to re-enter the passphrase to confirm you have it recorded. Finally, it will ask you for a password. This password is used to encrypt your wallet's seed and any private keys imported into your wallet on-disk. You will need this password any time you send bitcoins from your account. Bitcoins in cold storage If you are responsible for a large amount of bitcoin which can be exposed to online hacking or hardware failure, it is important to minimize your risk. A common schema for minimizing the risk is to split your online wallet between Hot wallet and Cold Storage. A hot wallet refers to your online wallet used for everyday deposits and withdrawals. Based on your customers' needs, you can store the minimum needed to cover the daily business. For example, Coinbase claims to hold approximately five percent of the total bitcoins on deposit in their hot wallet. The remaining amount is stored in cold storage. Cold storage is an offline wallet for bitcoin. Addresses are generated, typically from a deterministic wallet, with their passphrase and private keys stored offline. Periodically, depending on their day-to-day needs, bitcoins are transferred to and from the cold storage. Additionally, bitcoins may be moved to Deep cold storage. These bitcoins are generally more difficult to retrieve. While cold storage transfer may easily be done to cover the needs of the hot wallet, a deep cold storage schema may involve physically accessing the passphrase / private keys from a safe, a safety deposit box, or a bank vault. The reasoning is to slow down the access as much as possible. Cold storage with Electrum We can use Electrum to create a hot wallet and a cold storage wallet. To exemplify, let's imagine a business owner who wants to accept bitcoin from his PC cash register. For security reasons, he may want to allow access to the generation of new addresses to receive Bitcoin, but not access to spending them. Spending bitcoins from this wallet will be secured by a protected computer. To start, create a normal Electrum wallet on the protected computer. Secure the passphrase and assign a strong password to the wallet. Then, from the menu, select Wallet | Master Public Keys. The key will be displayed as shown in figure 3.12. Copy this number and save it to a USB key. Figure 3.12 - Your Electrum wallet's public master key Your master public key can be used to generate new public keys, but without access to the private keys. As mentioned in the previous examples, this has many practical uses, as in our example with the cash register. Next, from your cash register, install Electrum. On setup, or from File | New/Restore, choose Restore a wallet or import keys and the Standard wallet type: Figure 3.13 - Setting up a cash register wallet with Electrum On the next screen, Electrum will prompt you to enter your public master key. Once accepted, Electrum will generate your wallet from the public master key. When ready, your new wallet will be ready to accept bitcoin without access to the private keys. WARNING: If you import private keys into your Electrum wallet, they cannot be restored from your passphrase or public master key. They have not been generated by the root seed and exist independently in the wallet. If you import private keys, make sure to back up the wallet file after every import. Verifying access to a private key When working with public addresses, it may be important to prove that you have access to a private key. By using Bitcoin's cryptographic ability to sign a message, you can verify that you have access to the key without revealing it. This can be offered as proof from a trustee that they control the keys. Using Electrum's built-in message signing feature, we can use the private key in our wallet to sign a message. The message, combined with the digital signature and public address, can later be used to verify that it was signed with the original private key. To begin, choose an address from your wallet. In Electrum, your addresses can be found under the Addresses tab. Next, right click on an address and choose Sign/verify Message. A dialog box allowing you to sign a message will appear: Figure 3.14 - Electrum's Sign/Verify Message features As shown in figure 3.14, you can enter any message you like and sign it with the private key of the address shown. This process will produce a digital signature that can be shared with others to prove that you have access to the private key. To verify the signature on another computer, simply open Electrum and choose Tools | Sign | Verify Message from the menu. You will be prompted with the same dialog as shown in figure 3.14. Copy and paste the message, the address, and the digital signature, and click Verify. The results will be displayed. By requesting a signed message from someone, you can verify that they do, in fact, have control of the private key. This is useful for making sure that the trustee of a cold storage wallet has access to the private keys without releasing or sharing them. Another good  use of message signing is to prove that someone has control of some quantity of bitcoin. By signing a message that includes the public address with funds, one can see that the party is the owner of the funds. Finally, signing and verifying a message can be useful for testing your backups. You can test that your private key and public address completely offline without actually sending bitcoin to the address. Good housekeeping with Bitcoin To ensure the safe-keeping of your bitcoin, it's important to protect your private keys by following a short list of best practices: Never store your private keys unencrypted on your hard drive or in the cloud: Unencrypted wallets can easily be stolen by hackers, viruses, or malware. Make sure your keys are always encrypted before being saved to disk. Never send money to a Bitcoin address without a backup of the private keys: It's really important that you have a backup of your private key before sending money its public address. There are stories of early adopters who have lost significant amounts of bitcoin because of hardware failures or inadvertent mistakes. Always test your backup process by repeating the recovery steps: When setting up a backup plan, make sure to test your plan by backing up your keys, sending a small amount to the address, and recovering the amount from the backup. Message signing and verification is also a useful way to test your private key backups offline. Ensure that you have a secure location for your paper wallets: Unauthorized access to your paper wallets can result in the loss of your bitcoin. Make sure that you keep your wallets in a secure safe, in a bank safety deposit box, or in a vault. It's advisable to keep copies of your wallets in multiple locations. Keep multiple copies of your paper wallets: Paper can easily be damaged by water or direct sunlight. Make sure that you keep multiple copies of your paper wallets in plastic bags, protected from direct light with a cover. Consider writing a testament or will for your Bitcoin wallets: The testament should name who has access to the bitcoin and how they will be distributed. Make sure that you include instructions on how to recover the coins. Never forget your wallet's password or passphrase: This sounds obvious, but it must be emphasized. There is no way to recover a lost password or passphrase. Always use a strong passphrase: A strong passphrase should meet the following requirements: It should be long and difficult to guess It should not be from a famous publication: literature, holy books, and so on It should not contain personal information It should be easy to remember and type accurately It should not be reused between sites and applications Summary So far, we've covered the basics of how to get started with Bitcoin. We've provided a tutorial for setting up an online wallet and for how to buy Bitcoin in 15 minutes. We've covered online exchanges and marketplaces, and how to safely store and protect your bitcoin. Resources for Article: Further resources on this subject: Bitcoins – Pools and Mining [article] Going Viral [article] Introduction to the Nmap Scripting Engine [article]
Read more
  • 0
  • 1
  • 4788

article-image-putting-your-database-heart-azure-solutions
Packt
28 Oct 2015
19 min read
Save for later

Putting Your Database at the Heart of Azure Solutions

Packt
28 Oct 2015
19 min read
In this article by Riccardo Becker, author of the book Learning Azure DocumentDB, we will see how to build a real scenario around an Internet of Things scenario. This scenario will build a basic Internet of Things platform that can help to accelerate building your own. In this article, we will cover the following: Have a look at a fictitious scenario Learn how to combine Azure components with DocumentDB Demonstrate how to migrate data to DocumentDB (For more resources related to this topic, see here.) Introducing an Internet of Things scenario Before we start exploring different capabilities to support a real-life scenario, we will briefly explain the scenario we will use throughout this article. IoT, Inc. IoT, Inc. is a fictitious start-up company that is planning to build solutions in the Internet of Things domain. The first solution they will build is a registration hub, where IoT devices can be registered. These devices can be diverse, ranging from home automation devices up to devices that control traffic lights and street lights. The main use case for this solution is offering the capability for devices to register themselves against a hub. The hub will be built with DocumentDB as its core component and some Web API to expose this functionality. Before devices can register themselves, they need to be whitelisted in order to prevent malicious devices to start registering. In the following screenshot, we see the high-level design of the registration requirement: The first version of the solution contains the following components: A Web API containing methods to whitelist, register, unregister, and suspend devices DocumentDB, containing all the device information including information regarding other Microsoft Azure resources Event Hub, a Microsoft Azure asset that enables scalable publish-subscribe mechanism to ingress and egress millions of events per second Power BI, Microsoft’s online offering to expose reporting capabilities and the ability to share reports Obviously, we will focus on the core of the solution which is DocumentDB but it is nice to touch some of the Azure components, as well to see how well they co-operate and how easy it is to set up a demonstration for IoT scenarios. The devices on the left-hand side are chosen randomly and will be mimicked by an emulator written in C#. The Web API will expose the functionality required to let devices register themselves at the solution and start sending data afterwards (which will be ingested to the Event Hub and reported using Power BI). Technical requirements To be able to service potentially millions of devices, it is necessary that registration request from a device is being stored in a separate collection based on the country where the device is located or manufactured. Every device is being modeled in the same way, whereas additional metadata can be provided upon registration or afterwards when updating. To achieve country-based partitioning, we will create a custom PartitionResolver to achieve this goal. To extend the basic security model, we reduce the amount of sensitive information in our configuration files. Enhance searching capabilities because we want to service multiple types of devices each with their own metadata and device-specific information. Querying on all the information is desired to support full-text search and enable users to quickly search and find their devices. Designing the model Every device is being modeled similar to be able to service multiple types of devices. The device model contains at least the deviceid and a location. Furthermore, the device model contains a dictionary where additional device properties can be stored. The next code snippet shows the device model: [JsonProperty("id")]         public string DeviceId { get; set; }         [JsonProperty("location")]         public Point Location { get; set; }         //practically store any metadata information for this device         [JsonProperty("metadata")]         public IDictionary<string, object> MetaData { get; set; } The Location property is of type Microsoft.Azure.Documents.Spatial.Point because we want to run spatial queries later on in this section, for example, getting all the devices within 10 kilometers of a building. Building a custom partition resolver To meet the first technical requirement (partition data based on the country), we need to build a custom partition resolver. To be able to build one, we need to implement the IPartitionResolver interface and add some logic. The resolver will take the Location property of the device model and retrieves the country that corresponds with the latitude and longitude provided upon registration. In the following code snippet, you see the full implementation of the GeographyPartitionResolver class: public class GeographyPartitionResolver : IPartitionResolver     {         private readonly DocumentClient _client;         private readonly BingMapsHelper _helper;         private readonly Database _database;           public GeographyPartitionResolver(DocumentClient client, Database database)         {             _client = client;             _database = database;             _helper = new BingMapsHelper();         }         public object GetPartitionKey(object document)         {             //get the country for this document             //document should be of type DeviceModel             if (document.GetType() == typeof(DeviceModel))             {                 //get the Location and translate to country                 var country = _helper.GetCountryByLatitudeLongitude(                     (document as DeviceModel).Location.Position.Latitude,                     (document as DeviceModel).Location.Position.Longitude);                 return country;             }             return String.Empty;         }           public string ResolveForCreate(object partitionKey)         {             //get the country for this partitionkey             //check if there is a collection for the country found             var countryCollection = _client.CreateDocumentCollectionQuery(database.SelfLink).            ToList().Where(cl => cl.Id.Equals(partitionKey.ToString())).FirstOrDefault();             if (null == countryCollection)             {                 countryCollection = new DocumentCollection { Id = partitionKey.ToString() };                 countryCollection =                     _client.CreateDocumentCollectionAsync(_database.SelfLink, countryCollection).Result;             }             return countryCollection.SelfLink;         }           /// <summary>         /// Returns a list of collectionlinks for the designated partitionkey (one per country)         /// </summary>         /// <param name="partitionKey"></param>         /// <returns></returns>         public IEnumerable<string> ResolveForRead(object partitionKey)         {             var countryCollection = _client.CreateDocumentCollectionQuery(_database.SelfLink).             ToList().Where(cl => cl.Id.Equals(partitionKey.ToString())).FirstOrDefault();               return new List<string>             {                 countryCollection.SelfLink             };         }     } In order to have the DocumentDB client use this custom PartitionResolver, we need to assign it. The code is as follows: GeographyPartitionResolver resolver = new GeographyPartitionResolver(docDbClient, _database);   docDbClient.PartitionResolvers[_database.SelfLink] = resolver; //Adding a typical device and have the resolver sort out what //country is involved and whether or not the collection already //exists (and create a collection for the country if needed), use //the next code snippet. var deviceInAmsterdam = new DeviceModel             {                 DeviceId = Guid.NewGuid().ToString(),                 Location = new Point(4.8951679, 52.3702157)             };   Document modelAmsDocument = docDbClient.CreateDocumentAsync(_database.SelfLink,                 deviceInAmsterdam).Result;             //get all the devices in Amsterdam            var doc = docDbClient.CreateDocumentQuery<DeviceModel>(                 _database.SelfLink, null, resolver.GetPartitionKey(deviceInAmsterdam)); Now that we have created a country-based PartitionResolver, we can start working on the Web API that exposes the registration method. Building the Web API A Web API is an online service that can be used by any clients running any framework that supports the HTTP programming stack. Currently, REST is a way of interacting with APIs so that we will build a REST API. Building a good API should aim for platform independence. A well-designed API should also be able to extend and evolve without affecting existing clients. First, we need to whitelist the devices that should be able to register themselves against our device registry. The whitelist should at least contain a device ID, a unique identifier for a device that is used to match during the whitelisting process. A good candidate for a device ID is the mac address of the device or some random GUID. Registering a device The registration Web API contains a POST method that does the actual registration. First, it creates access to an Event Hub (not explained here) and stores the credentials needed inside the DocumentDB document. The document is then created inside the designated collection (based on the location). To learn more about Event Hubs, please visit https://azure.microsoft.com/en-us/services/event-hubs/.  [Route("api/registration")]         [HttpPost]         public async Task<IHttpActionResult> Post([FromBody]DeviceModel value)         {             //add the device to the designated documentDB collection (based on country)             try             { var serviceUri = ServiceBusEnvironment.CreateServiceUri("sb", serviceBusNamespace,                     String.Format("{0}/publishers/{1}", "telemetry", value.DeviceId))                     .ToString()                     .Trim('/');                 var sasToken = SharedAccessSignatureTokenProvider.GetSharedAccessSignature(EventHubKeyName,                     EventHubKey, serviceUri, TimeSpan.FromDays(365 * 100)); // hundred years will do                 //this token can be used by the device to send telemetry                 //this token and the eventhub name will be saved with the metadata of the document to be saved to DocumentDB                 value.MetaData.Add("Namespace", serviceBusNamespace);                 value.MetaData.Add("EventHubName", "telemetry");                 value.MetaData.Add("EventHubToken", sasToken);                 var document = await docDbClient.CreateDocumentAsync(_database.SelfLink, value);                 return Created(document.ContentLocation, value);            }             catch (Exception ex)             {                 return InternalServerError(ex);             }         } After this registration call, the right credentials on the Event Hub have been created for this specific device. The device is now able to ingress data to the Event Hub and have consumers like Power BI consume the data and present it. Event Hubs is a highly scalable publish-subscribe event ingestor. It can collect millions of events per second so that you can process and analyze the massive amounts of data produced by your connected devices and applications. Once collected into Event Hubs, you can transform and store the data by using any real-time analytics provider or with batching/storage adapters. At the time of writing, Microsoft announced the release of Azure IoT Suite and IoT Hubs. These solutions offer internet of things capabilities as a service and are well-suited to build our scenario as well. Increasing searching We have seen how to query our documents and retrieve the information we need. For this approach, we need to understand the DocumentDB SQL language. Microsoft has an online offering that enables full-text search called Azure Search service. This feature enables us to perform full-text searches and it also includes search behaviours similar to search engines. We could also benefit from so called type-ahead query suggestions based on the input of a user. Imagine a search box on our IoT Inc. portal that offers free text searching while the user types and search for devices that include any of the search terms on the fly. Azure Search runs on Azure; therefore, it is scalable and can easily be upgraded to offer more search and storage capacity. Azure Search stores all your data inside an index, offering full-text search capabilities on your data. Setting up Azure Search Setting up Azure Search is pretty straightforward and can be done by using the REST API it offers or on the Azure portal. We will set up the Azure Search service through the portal and later on, we will utilize the REST API to start configuring our search service. We set up the Azure Search service through the Azure portal (http://portal.azure.com). Find the Search service and fill out some information. In the following screenshot, we can see how we have created the free tier for Azure Search: You can see that we use the Free tier for this scenario and that there are no datasources configured yet. We will do that know by using the REST API. We will use the REST API, since it offers more insight on how the whole concept works. We use Fiddler to create a new datasource inside our search environment. The following screenshot shows how to use Fiddler to create a datasource and add a DocumentDB collection: In the Composer window of Fiddler, you can see we need to POST a payload to the Search service we created earlier. The Api-Key is mandatory and also set the content type to be JSON. Inside the body of the request, the connection information to our DocumentDB environment is need and the collection we want to add (in this case, Netherlands). Now that we have added the collection, it is time to create an Azure Search index. Again, we use Fiddler for this purpose. Since we use the free tier of Azure Search, we can only add five indexes at most. For this scenario, we add an index on ID (device ID), location, and metadata. At the time of writing, Azure Search does not support complex types. Note that the metadata node is represented as a collection of strings. We could check in the portal to see if the creation of the index was successful. Go to the Search blade and select the Search service we have just created. You can check the indexes part to see whether the index was actually created. The next step is creating an indexer. An indexer connects the index with the provided data source. Creating this indexer takes some time. You can check in the portal if the indexing process was successful. We actually find that documents are part of the index now. If your indexer needs to process thousands of documents, it might take some time for the indexing process to finish. You can check the progress of the indexer using the REST API again. https://iotinc.search.windows.net/indexers/deviceindexer/status?api-version=2015-02-28 Using this REST call returns the result of the indexing process and indicates if it is still running and also shows if there are any errors. Errors could be caused by documents that do not have the id property available. The final step involves testing to check whether the indexing works. We will search for a device ID, as shown in the next screenshot: In the Inspector tab, we can check for the results. It actually returns the correct document also containing the location field. The metadata is missing because complex JSON is not supported (yet) at the time of writing. Indexing complex JSON types is not supported yet. It is possible to add SQL queries to the data source. We could explicitly add a SELECT statement to surface the properties of the complex JSON we have like metadata or the Point property. Try adding additional queries to your data source to enable querying complex JSON types. Now that we have created an Azure Search service that indexes our DocumentDB collection(s), we can build a nice query-as-you-type field on our portal. Try this yourself. Enhancing security Microsoft Azure offers a capability to move your secrets away from your application towards Azure Key Vault. Azure Key Vault helps to protect cryptographic keys, secrets, and other information you want to store in a safe place outside your application boundaries (connectionstring are also good candidates). Key Vault can help us to protect the DocumentDB URI and its key. DocumentDB has no (in-place) encryption feature at the time of writing, although a lot of people already asked for it to be on the roadmap. Creating and configuring Key Vault Before we can use Key Vault, we need to create and configure it first. The easiest way to achieve this is by using PowerShell cmdlets. Please visit https://msdn.microsoft.com/en-us/mt173057.aspx to read more about PowerShell. The following PowerShell cmdlets demonstrate how to set up and configure a Key Vault: Command Description Get-AzureSubscription This command will prompt you to log in using your Microsoft Account. It returns a list of all Azure subscriptions that are available to you. Select-AzureSubscription -SubscriptionName "Windows Azure MSDN Premium" This tells PowerShell to use this subscription as being subject to our next steps. Switch-AzureMode AzureResourceManager New-AzureResourceGroup –Name 'IoTIncResourceGroup' –Location 'West Europe' This creates a new Azure Resource Group with a name and a location. New-AzureKeyVault -VaultName 'IoTIncKeyVault' -ResourceGroupName 'IoTIncResourceGroup' -Location 'West Europe' This creates a new Key Vault inside the resource group and provide a name and location. $secretvalue = ConvertTo-SecureString '<DOCUMENTDB KEY>' -AsPlainText –Force This creates a security string for my DocumentDB key. $secret = Set-AzureKeyVaultSecret -VaultName 'IoTIncKeyVault' -Name 'DocumentDBKey' -SecretValue $secretvalue This creates a key named DocumentDBKey into the vault and assigns it the secret value we have just received. Set-AzureKeyVaultAccessPolicy -VaultName 'IoTIncKeyVault' -ServicePrincipalName <SPN> -PermissionsToKeys decrypt,sign This configures the application with the Service Principal Name <SPN> to get the appropriate rights to decrypt and sign Set-AzureKeyVaultAccessPolicy -VaultName 'IoTIncKeyVault' -ServicePrincipalName <SPN> -PermissionsToSecrets Get This configures the application with SPN to also be able to get a key. Key Vault must be used together with Azure Active Directory to work. The SPN we need in the steps for powershell is actually is a client ID of an application I have set up in my Azure Active Directory. Please visit https://azure.microsoft.com/nl-nl/documentation/articles/active-directory-integrating-applications/ to see how you can create an application. Make sure to copy the client ID (which is retrievable afterwards) and the key (which is not retrievable afterwards). We use these two pieces of information to take the next step. Using Key Vault from ASP.NET In order to use the Key Vault we have created in the previous section, we need to install some NuGet packages into our solution and/or projects: Install-Package Microsoft.IdentityModel.Clients.ActiveDirectory -Version 2.16.204221202   Install-Package Microsoft.Azure.KeyVault These two packages enable us to use AD and Key Vault from our ASP.NET application. The next step is to add some configuration information to our web.config file: <add key="ClientId" value="<CLIENTID OF THE APP CREATED IN AD" />     <add key="ClientSecret" value="<THE SECRET FROM AZURE AD PORTAL>" />       <!-- SecretUri is the URI for the secret in Azure Key Vault -->     <add key="SecretUri" value="https://iotinckeyvault.vault.azure.net:443/secrets/DocumentDBKey" /> If you deploy the ASP.NET application to Azure, you could even configure these settings from the Azure portal itself, completely removing this from the web.config file. This technique adds an additional ring of security around your application. The following code snippet shows how to use AD and Key Vault inside the registration functionality of our scenario: //no more keys in code or .config files. Just a appid, secret and the unique URL to our key (SecretUri). When deploying to Azure we could             //even skip this by setting appid and clientsecret in the Azure Portal.             var kv = new KeyVaultClient(new KeyVaultClient.AuthenticationCallback(Utils.GetToken));             var sec = kv.GetSecretAsync(WebConfigurationManager.AppSettings["SecretUri"]).Result.Value; The Utils.GetToken method is shown next. This method retrieves an access token from AD by supplying the ClientId and the secret. Since we configured Key Vault to allow this application to get the keys, the call to GetSecretAsync() will succeed. The code is as follows: public async static Task<string> GetToken(string authority, string resource, string scope)         {             var authContext = new AuthenticationContext(authority);             ClientCredential clientCred = new ClientCredential(WebConfigurationManager.AppSettings["ClientId"],                         WebConfigurationManager.AppSettings["ClientSecret"]);             AuthenticationResult result = await authContext.AcquireTokenAsync(resource, clientCred);               if (result == null)                 throw new InvalidOperationException("Failed to obtain the JWT token");             return result.AccessToken;         } Instead of storing the key to DocumentDB somewhere in code or in the web.config file, it is now moved away to Key Vault. We could do the same with the URI to our DocumentDB and with other sensitive information as well (for example, storage account keys or connection strings). Encrypting sensitive data The documents we created in the previous section contains sensitive data like namespaces, Event Hub names, and tokens. We could also use Key Vault to encrypt those specific values to enhance our security. In case someone gets hold of a document containing the device information, he is still unable to mimic this device since the keys are encrypted. Try to use Key Vault to encrypt the sensitive information that is stored in DocumentDB before it is saved in there. Migrating data This section discusses how to use a tool to migrate data from an existing data source to DocumentDB. For this scenario, we assume that we already have a large datastore containing existing devices and their registration information (Event Hub connection information). In this section, we will see how to migrate an existing data store to our new DocumentDB environment. We use the DocumentDB Data Migration Tool for this. You can download this tool from the Microsoft Download Center (http://www.microsoft.com/en-us/download/details.aspx?id=46436) or from GitHub if you want to check the code. The tool is intuitive and enables us to migrate from several datasources: JSON files MongoDB SQL Server CSV files Azure Table storage Amazon DynamoDB HBase DocumentDB collections To demonstrate the use, we migrate our existing Netherlands collection to our United Kingdom collection. Start the tool and enter the right connection string to our DocumentDB database. We do this for both our source and target information in the tool. The connection strings you need to provide should look like this: AccountEndpoint=https://<YOURDOCDBURL>;AccountKey=<ACCOUNTKEY>;Database=<NAMEOFDATABASE>. You can click on the Verify button to make sure these are correct. In the Source Information field, we provide the Netherlands as being the source to pull data from. In the Target Information field, we specify the United Kingdom as the target. In the following screenshot, you can see how these settings are provided in the migration tool for the source information: The following screenshot shows the settings for the target information: It is also possible to migrate data to a collection that is not created yet. The migration tool can do this if you enter a collection name that is not available inside your database. You also need to select the pricing tier. Optionally, setting the partition key could help to distribute your documents based on this key across all collections you add in this screen. This information is sufficient to run our example. Go to the Summary tab and verify the information you entered. Press Import to start the migration process. We can verify a successful import on the Import results pane. This example is a simple migration scenario but the tool is also capable of using complex queries to only migrate those documents that need to moved or migrated. Try migrating data from an Azure Table storage table to DocumentDB by using this tool. Summary In this article, we saw how to integrate DocumentDB with other Microsoft Azure features. We discussed how to setup the Azure Search service and how create an index to our collection. We also covered how to use the Azure Search feature to enable full-text search on our documents which could enable users to query while typing. Next, we saw how to add additional security to our scenario by using Key Vault. We also discussed how to create and configure Key Vault by using PowerShell cmdlets, and we saw how to enable our ASP.NET scenario application to make use of the Key Vault .NET SDK. Then, we discussed how to retrieve the sensitive information from Key Vault instead of configuration files. Finally, we saw how to migrate an existing data source to our collection by using the DocumentDB Data Migration Tool. Resources for Article: Further resources on this subject: Microsoft Azure – Developing Web API For Mobile Apps [article] Introduction To Microsoft Azure Cloud Services [article] Security In Microsoft Azure [article]
Read more
  • 0
  • 0
  • 2292

article-image-rotation-forest-classifier-ensemble-based-feature-extraction
Packt
28 Oct 2015
16 min read
Save for later

Rotation Forest - A Classifier Ensemble Based on Feature Extraction

Packt
28 Oct 2015
16 min read
 In this article by Gopi Subramanian author of the book Python Data Science Cookbook you will learn bagging methods based on decision tree-based algorithms are very popular among the data science community. Rotation Forest The claim to fame of most of these methods is that they need zero data preparation as compared to the other methods, can obtain very good results, and can be provided as a black box of tools in the hands of software engineers. By design, bagging lends itself nicely to parallelization. Hence, these methods can be easily applied on a very large dataset in a cluster environment. The decision tree algorithms split the input data into various regions at each level of the tree. Thus, they perform implicit feature selection. Feature selection is one of the most important tasks in building a good model. By providing implicit feature selection, trees are at an advantageous position as compared to other techniques. Hence, bagging with decision trees comes with this advantage. Almost no data preparation is needed for decision trees. For example, consider the scaling of attributes. Attribute scaling has no impact on the structure of the trees. The missing values also do not affect decision trees. The effect of outliers is very minimal on a decision tree. We don’t have to do explicit feature transformations to accommodate feature interactions. One of the major complaints against tree-based methods is the difficulty with pruning the trees to avoid overfitting. Big trees tend to also fit the noise present in the underlying data and hence lead to a low bias and high variance. However, when we grow a lot of trees and the final prediction is an average of the output of all the trees in the ensemble, we avoid these problems. In this article, we will see a powerful tree-based ensemble method called rotation forest. A typical random forest requires a large number of trees to be a part of its ensemble in order to achieve good performance. Rotation forest can achieve similar or better performance with less number of trees. Additionally, the authors of this algorithm claim that the underlying estimator can be anything other than a tree. This way, it is projected as a new framework to build an ensemble similar to gradient boosting. (For more resources related to this topic, see here.) The algorithm Random forest and bagging gives impressive results with very large ensembles; having a large number of estimators results in the improvement of the accuracy of these methods. On the contrary, rotation forest is designed to work with a smaller number of ensembles. Let's write down the steps involved in building a rotation forest. The number of trees required in the forest is typically specified by the user. Let T be the number of trees required to be built. We start with iterating from one through T, that is, we build T trees. For each tree T, perform the following steps: Split the attributes in the training set into K nonoverlapping subsets of equal size. We have K datasets, each with K attributes. For each of the K datasets, we proceed to do the following: Bootstrap 75% of the data from each K dataset and use the bootstrapped sample for further steps. Run a principal component analysis on the ith subset in K. Retain all the principal components. For every feature j in the Kth subset, we have a principle component, a. Let's denote it as aij, where it’s the principal component for the jth attribute in the ith subset. Store the principal components for the subset. Create a rotation matrix of size, n X n, where n is the total number of attributes. Arrange the principal component in the matrix such that the components match the position of the feature in the original training dataset. Project the training dataset on the rotation matrix using the matrix multiplication. Build a decision tree with the projected dataset. Store the tree and rotation matrix.  A quick note about PCA: PCA is an unsupervised method. In multivariate problems, PCA is used to reduce the dimension of the data with minimal information loss or, in other words, retain maximum variation in the data. In PCA, we find the directions in the data with the most variation, that is, the eigenvectors corresponding to the largest eigenvalues of the covariance matrix and project the data onto these directions. With a dataset (n x m) with n instances and m dimensions, PCA projects it onto a smaller subspace (n x d), where d << m. A point to note is that PCA is computationally very expensive. Programming rotation forest in Python Now let's write a Python code to implement rotation forest. We will proceed to test it on a classification problem. We will generate some classification dataset to demonstrate rotation forest. To our knowledge, there is no Python implementation available for rotation forest. We will leverage scikit-learns’s implementation of the decision tree classifier and use the train_test_split method for the bootstrapping. Refer to the following link to learn more about scikit-learn: http://scikit-learn.org/stable/ First write the necessary code to implement rotation forest and apply it on a classification problem. We will start with loading all the necessary libraries. Let's leverage the make_classification method from the sklearn.dataset module to generate the training data. We will follow it with a method to select a random subset of attributes called gen_random_subset: from sklearn.datasets import make_classification from sklearn.metrics import classification_report from sklearn.cross_validation import train_test_split from sklearn.decomposition import PCA from sklearn.tree import DecisionTreeClassifier import numpy as np def get_data(): """ Make a sample classification dataset Returns : Independent variable y, dependent variable x """ no_features = 50 redundant_features = int(0.1*no_features) informative_features = int(0.6*no_features) repeated_features = int(0.1*no_features) x,y = make_classification(n_samples=500,n_features=no_features,flip_y=0.03, n_informative = informative_features, n_redundant = redundant_features ,n_repeated = repeated_features,random_state=7) return x,y def get_random_subset(iterable,k): subsets = [] iteration = 0 np.random.shuffle(iterable) subset = 0 limit = len(iterable)/k while iteration < limit: if k <= len(iterable): subset = k else: subset = len(iterable) subsets.append(iterable[-subset:]) del iterable[-subset:] iteration+=1 return subsets  We will write the build_rotationtree_model function, where we will build fully-grown trees and proceed to evaluate the forest’s performance using the model_worth function: def build_rotationtree_model(x_train,y_train,d,k): models = [] r_matrices = [] feature_subsets = [] for i in range(d): x,_,_,_ = train_test_split(x_train,y_train,test_size=0.3,random_state=7) # Features ids feature_index = range(x.shape[1]) # Get subsets of features random_k_subset = get_random_subset(feature_index,k) feature_subsets.append(random_k_subset) # Rotation matrix R_matrix = np.zeros((x.shape[1],x.shape[1]),dtype=float) for each_subset in random_k_subset: pca = PCA() x_subset = x[:,each_subset] pca.fit(x_subset) for ii in range(0,len(pca.components_)): for jj in range(0,len(pca.components_)): R_matrix[each_subset[ii],each_subset[jj]] = pca.components_[ii,jj] x_transformed = x_train.dot(R_matrix) model = DecisionTreeClassifier() model.fit(x_transformed,y_train) models.append(model) r_matrices.append(R_matrix) return models,r_matrices,feature_subsets def model_worth(models,r_matrices,x,y): predicted_ys = [] for i,model in enumerate(models): x_mod = x.dot(r_matrices[i]) predicted_y = model.predict(x_mod) predicted_ys.append(predicted_y) predicted_matrix = np.asmatrix(predicted_ys) final_prediction = [] for i in range(len(y)): pred_from_all_models = np.ravel(predicted_matrix[:,i]) non_zero_pred = np.nonzero(pred_from_all_models)[0] is_one = len(non_zero_pred) > len(models)/2 final_prediction.append(is_one) print classification_report(y, final_prediction)  Finally, we will write a main function used to invoke the functions that we have defined: if __name__ == "__main__": x,y = get_data() # plot_data(x,y) # Divide the data into Train, dev and test x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_size = 0.3,random_state=9) x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_all,test_size=0.3,random_state=9) # Build a bag of models models,r_matrices,features = build_rotationtree_model(x_train,y_train,25,5) model_worth(models,r_matrices,x_train,y_train) model_worth(models,r_matrices,x_dev,y_dev) Walking through our code  Let's start with our main function. We will invoke get_data to get our predictor attributes in the response attributes. In get_data, we will leverage the make_classification dataset to generate our training data for our recipe: def get_data(): """ Make a sample classification dataset Returns : Independent variable y, dependent variable x """ no_features = 30 redundant_features = int(0.1*no_features) informative_features = int(0.6*no_features) repeated_features = int(0.1*no_features) x,y = make_classification(n_samples=500,n_features=no_features,flip_y=0.03, n_informative = informative_features, n_redundant = redundant_features ,n_repeated = repeated_features,random_state=7) return x,y Let's look at the parameters passed to the make_classification method. The first parameter is the number of instances required; in this case, we say we need 500 instances. The second parameter is about how many attributes per instance are required. We say that we need 30. The third parameter, flip_y, randomly interchanges 3% of the instances. This is done to introduce some noise in our data. The next parameter is how many of these 30 features should be informative enough to be used in our classification. We have specified that 60% of our features, that is, 18 out of 30 should be informative. The next parameter is about redundant features. These are generated as a linear combination of the informative features in order to introduce correlation among the features. Finally, the repeated features are duplicate features, which are drawn randomly from both the informative and redundant features. Let's split the data into training and testing sets using train_test_split. We will reserve 30% of our data to test:  Divide the data into Train, dev and test x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_size = 0.3,random_state=9)  Once again, we will leverage train_test_split to split our test data into dev and test sets: x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_all,test_size=0.3,random_state=9)  With the data divided to build, evaluate, and test the model, we will proceed to build our models: models,r_matrices,features = build_rotationtree_model(x_train,y_train,25,5) We will invoke the build_rotationtree_model function to build our rotation forest. We will pass our training data, predictor, x_train  and response variable, y_train, the total number of trees to be built—25 in this case—and finally, the subset of features to be used—5 in this case. Let's jump into this function:  models = [] r_matrices = [] feature_subsets = []  We will begin with declaring three lists to store each of the decision tree, rotation matrix for this tree, and subset of features used in this iteration. We will proceed to build each tree in our ensemble. As a first order of business, we will bootstrap to retain only 75% of the data: x,_,_,_ = train_test_split(x_train,y_train,test_size=0.3,random_state=7)  We will leverage the train_test_split function from scikit-learn for the bootstrapping. We will then decide the feature subsets: # Features ids feature_index = range(x.shape[1]) # Get subsets of features random_k_subset = get_random_subset(feature_index,k) feature_subsets.append(random_k_subset)  The get_random_subset function takes the feature index and number of subsets required as parameters and returns K subsets.  In this function, we will shuffle the feature index. The feature index is an array of numbers starting from zero and ending with the number of features in our training set: np.random.shuffle(iterable)  Let's say that we have ten features and our k value is five, indicating that we need subsets with five nonoverlapping feature indices and we need to do two iterations. We will store the number of iterations needed in the limit variable: limit = len(iterable)/k while iteration < limit: if k <= len(iterable): subset = k else: subset = len(iterable) iteration+=1  If our required subset is less than the total number of attributes, we can proceed to use the first k entries in our iterable. As we shuffled our iterables, we will be returning different volumes at different times: subsets.append(iterable[-subset:])  On selecting a subset, we will remove it from the iterable as we need nonoverlapping sets: del iterable[-subset:]  With all the subsets ready, we will declare our rotation matrix: del iterable[-subset:]  With all the subsets ready, we will declare our rotation matrix: # Rotation matrix R_matrix = np.zeros((x.shape[1],x.shape[1]),dtype=float)  As you can see, our rotation matrix is of size, n x n, where is the number of attributes in our dataset. You can see that we have used the shape attribute to declare this matrix filled with zeros: for each_subset in random_k_subset: pca = PCA() x_subset = x[:,each_subset] pca.fit(x_subset)  For each of the K subsets of data having only K features, we will proceed to do a principal component analysis. We will fill our rotation matrix with the component values: for ii in range(0,len(pca.components_)): for jj in range(0,len(pca.components_)): R_matrix[each_subset[ii],each_subset[jj]] = pca.components_[ii,jj] For example, let's say that we have three attributes in our subset in a total of six attributes. For illustration, let's say that our subsets are as follows: 2,4,6 and 1,3,5 Our rotation matrix, R, is of size, 6 x 6. Assume that we want to fill the rotation matrix for the first subset of features. We will have three principal components, one each for 2,4, and 6 of size, 1 x 3. The output of PCA from scikit-learn is a matrix of size components X features. We will go through each component value in the for loop. At the first run, our feature of interest is two, and the cell (0,0) in the component matrix output from PCA gives the value of contribution of feature two to component one. We have to find the right place in the rotation matrix for this value. We will use the index from the component matrices, ii and jj, with the subset list to get the right place in the rotation matrix: R_matrix[each_subset[ii],each_subset[jj]] = pca.components_[ii,jj] The each_subset[0] and each_subset[0] will put us in cell (2,2) in the rotation matrix. As we go through the loop, the next component value in cell (0,1) in the component matrix will be placed in cell (2,4) in the rotation matrix and the last one in cell (2,6) in the rotation matrix. This is done for all the attributes in the first subset. Let's go to the second subset; here the first attribute is one. The cell (0,0) of the component matrix corresponds to the cell (1,1) in the rotation matrix. Proceeding this way, you will notice that the attribute component values are arranged in the same order as the attributes themselves. With our rotation matrix ready, let's project our input onto the rotation matrix: x_transformed = x_train.dot(R_matrix) It’s time now to fit our decision tree: model = DecisionTreeClassifier() model.fit(x_transformed,y_train) Finally, we will store our models and the corresponding rotation matrices: models.append(model) r_matrices.append(R_matrix) With our model built, let's proceed to see how good our model is in both the train and dev datasets using the model_worth function:  model_worth(models,r_matrices,x_train,y_train) model_worth(models,r_matrices,x_dev,y_dev) Let's see our model_worth function: for i,model in enumerate(models): x_mod = x.dot(r_matrices[i]) predicted_y = model.predict(x_mod) predicted_ys.append(predicted_y) In this function, perform prediction using each of the trees that we built. However, before doing the prediction, we will project our input using the rotation matrix. We will store all our prediction output in a list called predicted_ys. Let's say that we have 100 instances to predict and ten models in our tree; for each instance, we have ten predictions. We will store these as a matrix for convenience: predicted_matrix = np.asmatrix(predicted_ys) Now, we will proceed to give a final classification for each of our input records: final_prediction = [] for i in range(len(y)): pred_from_all_models = np.ravel(predicted_matrix[:,i]) non_zero_pred = np.nonzero(pred_from_all_models)[0] is_one = len(non_zero_pred) > len(models)/2 final_prediction.append(is_one) We will store our final prediction in a list called final_prediction. We will go through each of the predictions for our instance. Let's say that we are in the first instance (i=0 in our for loop); pred_from_all_models stores the output from all the trees in our model. It’s an array of zeros and ones indicating which class has the model classified in this instance. We will make another array out of it, non_zero_pred, which has only those entries from the parent arrays that are non-zero. Finally, if the length of this non-zero array is greater than half the number of models that we have, we say that our final prediction is one for the instance of interest. What we have accomplished here is the classic voting scheme. Let's look at how good our models are now by calling a classification report: print classification_report(y, final_prediction) Here is how good our model has performed on the training set: Let's see how good our model performance is on the dev dataset: References More information about rotation forest can be obtained from the following paper:. Rotation Forest: A New Classifier Ensemble Method Juan J. Rodrı´guez, Member, IEEE Computer Society, Ludmila I. Kuncheva, Member, IEEE, and Carlos J. Alonso The paper also claims that when rotation forest was compared to bagging, AdBoost, and random forest on 33 datasets, rotation forest outperformed all the other three algorithms. Similar to gradient boosting, authors of the paper claim that rotation forest is an overall framework and the underlying ensemble is not necessary to be a decision tree. Work is in progress in testing other algorithms such as Naïve Bayes, Neural Networks, and similar others. Summary In this article will learnt the bagging methods based on decision tree-based algorithms are very popular among the data science community. Resources for Article: Further resources on this subject: Mobile Phone Forensics – A First Step into Android Forensics [article] Develop a Digital Clock [article] Monitoring Physical Network Bandwidth Using OpenStack Ceilometer [article]
Read more
  • 0
  • 0
  • 9164
article-image-introduction-kibana
Packt
28 Oct 2015
28 min read
Save for later

An Introduction to Kibana

Packt
28 Oct 2015
28 min read
In this article by Yuvraj Gupta, author of the book, Kibana Essentials, explains Kibana is a tool that is part of the ELK stack, which consists of Elasticsearch, Logstash, and Kibana. It is built and developed by Elastic. Kibana is a visualization platform that is built on top of Elasticsearch and leverages the functionalities of Elasticsearch. (For more resources related to this topic, see here.) To understand Kibana better, let's check out the following diagram: This diagram shows that Logstash is used to push data directly into Elasticsearch. This data is not limited to log data, but can include any type of data. Elasticsearch stores data that comes as input from Logstash, and Kibana uses the data stored in Elasticsearch to provide visualizations. So, Logstash provides an input stream of data to Elasticsearch, from which Kibana accesses the data and uses it to create visualizations. Kibana acts as an over-the-top layer of Elasticsearch, providing beautiful visualizations for data (structured or nonstructured) stored in it. Kibana is an open source analytics product used to search, view, and analyze data. It provides various types of visualizations to visualize data in the form of tables, charts, maps, histograms, and so on. It also provides a web-based interface that can easily handle a large amount of data. It helps create dashboards that are easy to create and helps query data in real time. Dashboards are nothing but an interface for underlying JSON documents. They are used for saving, templating, and exporting. They are simple to set up and use, which helps us play with data stored in Elasticsearch in minutes without requiring any coding. Kibana is an Apache-licensed product that aims to provide a flexible interface combined with the powerful searching capabilities of Elasticsearch. It requires a web server (included in the Kibana 4 package) and any modern web browser, that is, a browser that supports industry standards and renders the web page in the same way across all browsers, to work. It connects to Elasticsearch using the REST API. It helps to visualize data in real time with the use of dashboards to provide real-time insights. As Kibana uses the functionalities of Elasticsearch, it is easier to learn Kibana by understanding the core functionalities of Elasticsearch. In this article, we are going to take a look at the following topics: The basic concepts of Elasticsearch Installation of Java Installation of Elasticsearch Installation of Kibana Importing a JSON file into Elasticsearch Understanding Elasticsearch Elasticsearch is a search server built on top of Lucene (licensed under Apache), which is completely written in Java. It supports distributed searches in a multitenant environment. It is a scalable search engine allowing high flexibility of adding machines easily. It provides a full-text search engine combined with a RESTful web interface and JSON documents. Elasticsearch harnesses the functionalities of Lucene Java Libraries, adding up by providing proper APIs, scalability, and flexibility on top of the Lucene full-text search library. All querying done using Elasticsearch, that is, searching text, matching text, creating indexes, and so on, is implemented by Apache Lucene. Without a setup of an Elastic shield or any other proxy mechanism, any user with access to Elasticsearch API can view all the data stored in the cluster. The basic concepts of Elasticsearch Let's explore some of the basic concepts of Elasticsearch: Field: This is the smallest single unit of data stored in Elasticsearch. It is similar to a column in a traditional relational database. Every document contains key-value pairs, which are referred to as fields. Values in a field can contain a single value, such as integer [27], string ["Kibana"], or multiple values, such as array [1, 2, 3, 4, 5]. The field type is responsible for specifying which type of data can be stored in a particular field, for example, integer, string, date, and so on. Document: This is the simplest unit of information stored in Elasticsearch. It is a collection of fields. It is considered similar to a row of a table in a traditional relational database. A document can contain any type of entry, such as a document for a single restaurant, another document for a single cuisine, and yet another for a single order. Documents are in JavaScript Object Notation (JSON), which is a language-independent data interchange format. JSON contains key-value pairs. Every document that is stored in Elasticsearch is indexed. Every document contains a type and an ID. An example of a document that has JSON values is as follows: { "name": "Yuvraj", "age": 22, "birthdate": "2015-07-27", "bank_balance": 10500.50, "interests": ["playing games","movies","travelling"], "movie": {"name":"Titanic","genre":"Romance","year" : 1997} } In the preceding example, we can see that the document supports JSON, having key-value pairs, which are explained as follows: The name field is of the string type The age field is of the numeric type The birthdate field is of the date type The bank_balance field is of the float type The interests field contains an array The movie field contains an object (dictionary) Type: This is similar to a table in a traditional relational database. It contains a list of fields, which is defined for every document. A type is a logical segregation of indexes, whose interpretation/semantics entirely depends on you. For example, you have data about the world and you put all your data into an index. In this index, you can define a type for continent-wise data, another type for country-wise data, and a third type for region-wise data. Types are used with a mapping API; it specifies the type of its field. An example of type mapping is as follows: { "user": { "properties": { "name": { "type": "string" }, "age": { "type": "integer" }, "birthdate": { "type": "date" }, "bank_balance": { "type": "float" }, "interests": { "type": "string" }, "movie": { "properties": { "name": { "type": "string" }, "genre": { "type": "string" }, "year": { "type": "integer" } } } } } } Now, let's take a look at the core data types specified in Elasticsearch, as follows: Type Definition string This contains text, for example, "Kibana" integer This contains a 32-bit integer, for example, 7 long This contains a 64-bit integer float IEEE float, for example, 2.7 double This is a double-precision float boolean This can be true or false date This is the UTC date/time, for example, "2015-06-30T13:10:10" geo_point This is the latitude or longitude Index: This is a collection of documents (one or more than one). It is similar to a database in the analogy with traditional relational databases. For example, you can have an index for user information, transaction information, and product type. An index has a mapping; this mapping is used to define multiple types. In other words, an index can contain single or multiple types. An index is defined by a name, which is always used whenever referring to an index to perform search, update, and delete operations for documents. You can define any number of indexes you require. Indexes also act as logical namespaces that map documents to primary shards, which contain zero or more replica shards for replicating data. With respect to traditional databases, the basic analogy is similar to the following: MySQL => Databases => Tables => Columns/Rows Elasticsearch => Indexes => Types => Documents with Fields You can store a single document or multiple documents within a type or index. As a document is within an index, it must also be assigned to a type within an index. Moreover, the maximum number of documents that you can store in a single index is 2,147,483,519 (2 billion 147 million), which is equivalent to Integer.Max_Value. ID: This is an identifier for a document. It is used to identify each document. If it is not defined, it is autogenerated for every document.The combination of index, type, and ID must be unique for each document. Mapping: Mappings are similar to schemas in a traditional relational database. Every document in an index has a type. A mapping defines the fields, the data type for each field, and how the field should be handled by Elasticsearch. By default, a mapping is automatically generated whenever a document is indexed. If the default settings are overridden, then the mapping's definition has to be provided explicitly. Node: This is a running instance of Elasticsearch. Each node is part of a cluster. On a standalone machine, each Elasticsearch server instance corresponds to a node. Multiple nodes can be started on a single standalone machine or a single cluster. The node is responsible for storing data and helps in the indexing/searching capabilities of a cluster. By default, whenever a node is started, it is identified and assigned a random Marvel Comics character name. You can change the configuration file to name nodes as per your requirement. A node also needs to be configured in order to join a cluster, which is identifiable by the cluster name. By default, all nodes join the Elasticsearch cluster; that is, if any number of nodes are started up on a network/machine, they will automatically join the Elasticsearch cluster. Cluster: This is a collection of nodes and has one or multiple nodes; they share a single cluster name. Each cluster automatically chooses a master node, which is replaced if it fails; that is, if the master node fails, another random node will be chosen as the new master node, thus providing high availability. The cluster is responsible for holding all of the data stored and provides a unified view for search capabilities across all nodes. By default, the cluster name is Elasticsearch, and it is the identifiable parameter for all nodes in a cluster. All nodes, by default, join the Elasticsearch cluster. While using a cluster in the production phase, it is advisable to change the cluster name for ease of identification, but the default name can be used for any other purpose, such as development or testing.The Elasticsearch cluster contains single or multiple indexes, which contain single or multiple types. All types contain single or multiple documents, and every document contains single or multiple fields. Sharding: This is an important concept of Elasticsearch while understanding how Elasticsearch allows scaling of nodes, when having a large amount of data termed as big data. An index can store any amount of data, but if it exceeds its disk limit, then searching would become slow and be affected. For example, the disk limit is 1 TB, and an index contains a large number of documents, which may not fit completely within 1 TB in a single node. To counter such problems, Elasticsearch provides shards. These break the index into multiple pieces. Each shard acts as an independent index that is hosted on a node within a cluster. Elasticsearch is responsible for distributing shards among nodes. There are two purposes of sharding: allowing horizontal scaling of the content volume, and improving performance by providing parallel operations across various shards that are distributed on nodes (single or multiple, depending on the number of nodes running).Elasticsearch helps move shards among multiple nodes in the event of an addition of new nodes or a node failure. There are two types of shards, as follows: Primary shard: Every document is stored within a primary index. By default, every index has five primary shards. This parameter is configurable and can be changed to define more or fewer shards as per the requirement. A primary shard has to be defined before the creation of an index. If no parameters are defined, then five primary shards will automatically be created.Whenever a document is indexed, it is usually done on a primary shard initially, followed by replicas. The number of primary shards defined in an index cannot be altered once the index is created. Replica shard: Replica shards are an important feature of Elasticsearch. They help provide high availability across nodes in the cluster. By default, every primary shard has one replica shard. However, every primary shard can have zero or more replica shards as required. In an environment where failure directly affects the enterprise, it is highly recommended to use a system that provides a failover mechanism to achieve high availability. To counter this problem, Elasticsearch provides a mechanism in which it creates single or multiple copies of indexes, and these are termed as replica shards or replicas. A replica shard is a full copy of the primary shard. Replica shards can be dynamically altered. Now, let's see the purposes of creating a replica. It provides high availability in the event of failure of a node or a primary shard. If there is a failure of a primary shard, replica shards are automatically promoted to primary shards. Increase performance by providing parallel operations on replica shards to handle search requests.A replica shard is never kept on the same node as that of the primary shard from which it was copied. Inverted index: This is also a very important concept in Elasticsearch. It is used to provide fast full-text search. Instead of searching text, it searches for an index. It creates an index that lists unique words occurring in a document, along with the document list in which each word occurs. For example, suppose we have three documents. They have a text field, and it contains the following: I am learning Kibana Kibana is an amazing product Kibana is easy to use To create an inverted index, the text field is broken into words (also known as terms), a list of unique words is created, and also a listing is done of the document in which the term occurs, as shown in this table: Term Doc 1 Doc 2 Doc 3 I X     Am X     Learning X     Kibana X X X Is   X X An   X   Amazing   X   Product   X   Easy     X To     X Use     X Now, if we search for is Kibana, Elasticsearch will use an inverted index to display the results: Term Doc 1 Doc 2 Doc 3 Is   X X Kibana X X X With inverted indexes, Elasticsearch uses the functionality of Lucene to provide fast full-text search results. An inverted index uses an index based on keywords (terms) instead of a document-based index. REST API: This stands for Representational State Transfer. It is a stateless client-server protocol that uses HTTP requests to store, view, and delete data. It supports CRUD operations (short for Create, Read, Update, and Delete) using HTTP. It is used to communicate with Elasticsearch and is implemented by all languages. It communicates with Elasticsearch over port 9200 (by default), which is accessible from any web browser. Also, Elasticsearch can be directly communicated with via the command line using the curl command. cURL is a command-line tool used to send, view, or delete data using URL syntax, as followed by the HTTP structure. A cURL request is similar to an HTTP request, which is as follows: curl -X <VERB> '<PROTOCOL>://<HOSTNAME>:<PORT>/<PATH>?<QUERY_STRING>' -d '<BODY>' The terms marked within the <> tags are variables, which are defined as follows: VERB: This is used to provide an appropriate HTTP method, such as GET (to get data), POST, PUT (to store data), or DELETE (to delete data). PROTOCOL: This is used to define whether the HTTP or HTTPS protocol is used to send requests. HOSTNAME: This is used to define the hostname of a node present in the Elasticsearch cluster. By default, the hostname of Elasticsearch is localhost. PORT: This is used to define the port on which Elasticsearch is running. By default, Elasticsearch runs on port 9200. PATH: This is used to define the index, type, and ID where the documents will be stored, searched, or deleted. It is specified as index/type/ID. QUERY_STRING: This is used to define any additional query parameter for searching data. BODY: This is used to define a JSON-encoded request within the body. In order to put data into Elasticsearch, the following curl command is used: curl -XPUT 'http://localhost:9200/testing/test/1' -d '{"name": "Kibana" }' Here, testing is the name of the index, test is the name of the type within the index, and 1 indicates the ID number. To search for the preceding stored data, the following curl command is used: curl -XGET 'http://localhost:9200/testing/_search? The preceding commands are provided just to give you an overview of the format of the curl command. Prerequisites for installing Kibana 4.1.1 The following pieces of software need to be installed before installing Kibana 4.1.1: Java 1.8u20+ Elasticsearch v1.4.4+ A modern web browser—IE 10+, Firefox, Chrome, Safari, and so on The installation process will be covered separately for Windows and Ubuntu so that both types of users are able to understand the process of installation easily. Installation of Java In this section, JDK needs to be installed so as to access Elasticsearch. Oracle Java 8 (update 20 onwards) will be installed as it is the recommended version for Elasticsearch from version 1.4.4 onwards. Installation of Java on Ubuntu 14.04 Install Java 8 using the terminal and the apt package in the following manner: Add the Oracle Java Personal Package Archive (PPA) to the apt repository list: sudo add-apt-repository -y ppa:webupd8team/java In this case, we use a third-party repository; however, the WebUpd8 team is trusted to install Java. It does not include any Java binaries. Instead, the PPA directly downloads from Oracle and installs it. As shown in the preceding screenshot, you will initially be prompted for the password for running the sudo command (only when you have not logged in as root), and on successful addition to the repository, you will receive an OK message, which means that the repository has been imported. Update the apt package database to include all the latest files under the packages: sudo apt-get update Install the latest version of Oracle Java 8: sudo apt-get -y install oracle-java8-installer Also, during the installation, you will be prompted to accept the license agreement, which pops up as follows: To check whether Java has been successfully installed, type the following command in the terminal:java –version This signifies that Java has been installed successfully. Installation of Java on Windows We can install Java on windows by going through the following steps: Download the latest version of the Java JDK from the Sun Microsystems site at http://www.oracle.com/technetwork/java/javase/downloads/index.html:                                                                                     As shown in the preceding screenshot, click on the DOWNLOAD button of JDK to download. You will be redirected to the download page. There, you have to first click on the Accept License Agreement radio button, followed by the Windows version to download the .exe file, as shown here: Double-click on the file to be installed and it will open as an installer. Click on Next, accept the license by reading it, and keep clicking on Next until it shows that JDK has been installed successfully. Now, to run Java on Windows, you need to set the path of JAVA in the environment variable settings of Windows. Firstly, open the properties of My Computer. Select the Advanced system settings and then click on the Advanced tab, wherein you have to click on the environment variables option, as shown in this screenshot: After opening Environment Variables, click on New (under the System variables) and give the variable name as JAVA_HOME and variable value as C:Program FilesJavajdk1.8.0_45 (do check in your system where jdk has been installed and provide the path corresponding to the version installed as mentioned in system directory), as shown in the following screenshot: Then, double-click on the Path variable (under the System variables) and move towards the end of textbox. Insert a semicolon if it is not already inserted, and add the location of the bin folder of JDK, like this: %JAVA_HOME%bin. Next, click on OK in all the windows opened. Do not delete anything within the path variable textbox. To check whether Java is installed or not, type the following command in Command Prompt: java –version This signifies that Java has been installed successfully. Installation of Elasticsearch In this section, Elasticsearch, which is required to access Kibana, will be installed. Elasticsearch v1.5.2 will be installed, and this section covers the installation on Ubuntu and Windows separately. Installation of Elasticsearch on Ubuntu 14.04 To install Elasticsearch on Ubuntu, perform the following steps: Download Elasticsearch v 1.5.2 as a .tar file using the following command on the terminal:  curl -L -O https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.5.2.tar.gz Curl is a package that may not be installed on Ubuntu by the user. To use curl, you need to install the curl package, which can be done using the following command: sudo apt-get -y install curl Extract the downloaded .tar file using this command: tar -xvzf elasticsearch-1.5.2.tar.gzThis will extract the files and folder into the current working directory. Navigate to the bin directory within the elasticsearch-1.5.2 directory: cd elasticsearch-1.5.2/bin Now run Elasticsearch to start the node and cluster, using the following command:./elasticsearch The preceding screenshot shows that the Elasticsearch node has been started, and it has been given a random Marvel Comics character name. If this terminal is closed, Elasticsearch will stop running as this node will shut down. However, if you have multiple Elasticsearch nodes running, then shutting down a node will not result in shutting down Elasticsearch. To verify the Elasticsearch installation, open http://localhost:9200 in your browser. Installation of Elasticsearch on Windows The installation on Windows can be done by following similar steps as in the case of Ubuntu. To use curl commands on Windows, we will be installing GIT. GIT will also be used to import a sample JSON file into Elasticsearch using elasticdump, as described in the Importing a JSON file into Elasticsearch section. Installation of GIT To run curl commands on Windows, first download and install GIT, then perform the following steps: Download the GIT ZIP package from https://git-scm.com/download/win. Double-click on the downloaded file, which will walk you through the installation process. Keep clicking on Next by not changing the default options until the Finish button is clicked on. To validate the GIT installation, right-click on any folder in which you should be able to see the options of GIT, such as GIT Bash, as shown in the following screenshot: The following are the steps required to install Elasticsearch on Windows: Open GIT Bash and enter the following command in the terminal:  curl –L –O https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.5.2.zip Extract the downloaded ZIP package by either unzipping it using WinRar, 7Zip, and so on (if you don't have any of these, download one of them) or using the following command in GIT Bash: unzip elasticsearch-1.5.2.zip This will extract the files and folder into the directory. Then click on the extracted folder and navigate through it to reach the bin folder. Click on the elasticsearch.bat file to run Elasticsearch. The preceding screenshot shows that the Elasticsearch node has been started, and it is given a random Marvel Comics character's name. Again, if this window is closed, Elasticsearch will stop running as this node will shut down. However, if you have multiple Elasticsearch nodes running, then shutting down a node will not result in shutting down Elasticsearch. To verify the Elasticsearch installation, open http://localhost:9200 in your browser. Installation of Kibana In this section, Kibana will be installed. We will install Kibana v4.1.1, and this section covers installations on Ubuntu and Windows separately. Installation of Kibana on Ubuntu 14.04 To install Kibana on Ubuntu, follow these steps: Download Kibana version 4.1.1 as a .tar file using the following command in the terminal:  curl -L -O https://download.elasticsearch.org/kibana/kibana/kibana-4.1.1-linux-x64.tar.gz Extract the downloaded .tar file using this command: tar -xvzf kibana-4.1.1-linux-x64.tar.gz The preceding command will extract the files and folder into the current working directory. Navigate to the bin directory within the kibana-4.1.1-linux-x64 directory: cd kibana-4.1.1-linux-x64/bin Now run Kibana to start the node and cluster using the following command: Make sure that Elasticsearch is running. If it is not running and you try to start Kibana, the following error will be displayed after you run the preceding command: To verify the Kibana installation, open http://localhost:5601 in your browser. Installation of Kibana on Windows To install Kibana on Windows, perform the following steps: Open GIT Bash and enter the following command in the terminal:  curl -L -O https://download.elasticsearch.org/kibana/kibana/kibana-4.1.1-windows.zip Extract the downloaded ZIP package by either unzipping it using WinRar or 7Zip (download it if you don't have it), or using the following command in GIT Bash: unzip kibana-4.1.1-windows.zip This will extract the files and folder into the directory. Then click on the extracted folder and navigate through it to get to the bin folder. Click on the kibana.bat file to run Kibana. Make sure that Elasticsearch is running. If it is not running and you try to start Kibana, the following error will be displayed after you click on the kibana.bat file: Again, to verify the Kibana installation, open http://localhost:5601 in your browser. Additional information You can change the Elasticsearch configuration for your production environment, wherein you have to change parameters such as the cluster name, node name, network address, and so on. This can be done using the information mentioned in the upcoming sections.. Changing the Elasticsearch configuration To change the Elasticsearch configuration, perform the following steps: Run the following command in the terminal to open the configuration file: sudo vi ~/elasticsearch-1.5.2/config/elasticsearch.yml Windows users can open the elasticsearch.yml file from the config folder. This will open the configuration file as follows: The cluster name can be changed, as follows: #cluster.name: elasticsearch to cluster.name: "your_cluster_name". In the preceding figure, the cluster name has been changed to test. Then, we save the file. To verify that the cluster name has been changed, run Elasticsearch as mentioned in the earlier section. Then open http://localhost:9200 in the browser to verify, as shown here: In the preceding screenshot, you can notice that cluster_name has been changed to test, as specified earlier. Changing the Kibana configuration To change the Kibana configuration, follow these steps: Run the following command in the terminal to open the configuration file: sudo vi ~/kibana-4.1.1-linux-x64/config/kibana.yml Windows users can open the kibana.yml file from the config folder In this file, you can change various parameters such as the port on which Kibana works, the host address on which Kibana works, the URL of Elasticsearch that you wish to connect to, and so on For example, the port on which Kibana works can be changed by changing the port address. As shown in the following screenshot, port: 5601 can be changed to any other port, such as port: 5604. Then we save the file. To check whether Kibana is running on port 5604, run Kibana as mentioned earlier. Then open http://localhost:5604 in the browser to verify, as follows: Importing a JSON file into Elasticsearch To import a JSON file into Elasticsearch, we will use the elasticdump package. It is a set of import and export tools used for Elasticsearch. It makes it easier to copy, move, and save indexes. To install elasticdump, we will require npm and Node.js as prerequisites. Installation of npm In this section, npm along with Node.js will be installed. This section covers the installation of npm and Node.js on Ubuntu and Windows separately. Installation of npm on Ubuntu 14.04 To install npm on Ubuntu, perform the following steps: Add the official Node.js PPA: sudo curl --silent --location https://deb.nodesource.com/setup_0.12 | sudo bash - As shown in the preceding screenshot, the command will add the official Node.js repository to the system and update the apt package database to include all the latest files under the packages. At the end of the execution of this command, we will be prompted to install Node.js and npm, as shown in the following screenshot: Install Node.js by entering this command in the terminal: sudo apt-get install --yes nodejs This will automatically install Node.js and npm as npm is bundled within Node.js. To check whether Node.js has been installed successfully, type the following command in the terminal: node –v Upon successful installation, it will display the version of Node.js. Now, to check whether npm has been installed successfully, type the following command in the terminal: npm –v Upon successful installation, it will show the version of npm. Installation of npm on Windows To install npm on Windows, follow these steps: Download the Windows Installer (.msi) file by going to https://nodejs.org/en/download/. Double-click on the downloaded file and keep clicking on Next to install the software. To validate the successful installation of Node.js, right-click and select GIT Bash. In GIT Bash, enter this: node –v Upon successful installation, you will be shown the version of Node.js. To validate the successful installation of npm, right-click and select GIT Bash. In GIT Bash, enter the following line: npm –v Upon successful installation, it will show the version of npm. Installing elasticdump In this section, elasticdump will be installed. It will be used to import a JSON file into Elasticsearch. It requires npm and Node.js installed. This section covers the installation on Ubuntu and Windows separately. Installing elasticdump on Ubuntu 14.04 Perform these steps to install elasticdump on Ubuntu: Install elasticdump by typing the following command in the terminal: sudo npm install elasticdump -g Then run elasticdump by typing this command in the terminal: elasticdump Import a sample data (JSON) file into Elasticsearch, which can be downloaded from https://github.com/guptayuvraj/Kibana_Essentials and is named tweet.json. It will be imported into Elasticsearch using the following command in the terminal: elasticdump --bulk=true --input="/home/yuvraj/Desktop/tweet.json" --output=http://localhost:9200/ Here, input provides the location of the file, as shown in the following screenshot: As you can see, data is being imported to Elasticsearch from the tweet.json file, and the dump complete message is displayed when all the records are imported to Elasticsearch successfully. Elasticsearch should be running while importing the sample file. Installing elasticdump on Windows To install elasticdump on Windows, perform the following steps: Install elasticdump by typing the following command in GIT Bash: npm install elasticdump -g                                                                                                           Then run elasticdump by typing this command in GIT Bash: elasticdump Import the sample data (JSON) file into Elasticsearch, which can be downloaded from https://github.com/guptayuvraj/Kibana_Essentials and is named tweet.json. It will be imported to Elasticsearch using the following command in GIT Bash: elasticdump --bulk=true --input="C:UsersyguptaDesktoptweet.json" --output=http://localhost:9200/ Here, input provides the location of the file. The preceding screenshot shows data being imported to Elasticsearch from the tweet.json file, and the dump complete message is displayed when all the records are imported to Elasticsearch successfully. Elasticsearch should be running while importing the sample file. To verify that the data has been imported to Elasticsearch, open http://localhost:5601 in your browser, and this is what you should see: When Kibana is opened, you have to configure an index pattern. So, if data has been imported, you can enter the index name, which is mentioned in the tweet.json file as index: tweet. After the page loads, you can see to the left under Index Patterns the name of the index that has been imported (tweet). Now mention the index name as tweet. It will then automatically detect the timestamped field and will provide you with an option to select the field. If there are multiple fields, then you can select them by clicking on Time-field name, which will provide a drop-down list of all fields available, as shown here: Finally, click on Create to create the index in Kibana. After you have clicked on Create, it will display the various fields present in this index. If you do not get the options of Time-field name and Create after entering the index name as tweet, it means that the data has not been imported into Elasticsearch. Summary In this article, you learned about Kibana, along with the basic concepts of Elasticsearch. These help in the easy understanding of Kibana. We also looked at the prerequisites for installing Kibana, followed by a detailed explanation of how to install each component individually in Ubuntu and Windows. Resources for Article: Further resources on this subject: Understanding Ranges [article] Working On Your Bot [article] Welcome to the Land of BludBorne [article]
Read more
  • 0
  • 0
  • 2666

article-image-making-3d-visualizations
Packt
26 Oct 2015
5 min read
Save for later

Making 3D Visualizations

Packt
26 Oct 2015
5 min read
 Python has become the preferred language of data scientists for data analysis, visualization, and machine learning. It features numerical and mathematical toolkits such as: Numpy, Scipy, Sci-kit learn, Matplotlib and Pandas, as well as a R-like environment with IPython, all used for data analysis, visualization and machine learning. In this article by Dimitry Foures and Giuseppe Vettigli, authors of the book Python Data Visualization Cookbook, Second Edition, we will see how visualization in 3D is sometimes effective and sometimes inevitable. In this article, you will learn the how 3D bars are created. (For more resources related to this topic, see here.) Creating 3D bars Although matplotlib is mainly focused on plotting and 2D, there are different extensions that enable us to plot over geographical maps, to integrate more with Excel, and plot in 3D. These extensions are called toolkits in matplotlib world. A toolkit is a collection of specific functions that focuses on one topic, such as plotting in 3D. Popular toolkits are Basemap, GTK Tools, Excel Tools, Natgrid, AxesGrid, and mplot3d. We will explore more of mplot3d in this recipe. The mpl_toolkits.mplot3d toolkit provides some basic 3D plotting. Plots supported are scatter, surf, line, and mesh. Although this is not the best 3D plotting library, it comes with matplotlib, and we are already familiar with this interface.   Getting ready Basically, we still need to create a figure and add desired axes to it. Difference is that we specify 3D projection for the figure, and the axes we add is Axes3D. Now, we can almost use the same functions for plotting. Of course, the difference is the arguments passed. For we now have three axes, which we need to provide data for. For example, the mpl_toolkits.mplot3d.Axes3D.plot function specifies the xs, ys, zs, and zdir arguments. All others are transferred directly to matplotlib.axes.Axes.plot. We will explain these specific arguments: xs,ys: These are coordinates for X and Y axis zs: These are value(s) for Z axis. Can be one for all points, or one for each point zdir: These values choose what will be the z-axis dimension (usually this is zs, but can be xs, or ys) There is a rotate_axes method in module mpl_toolkits.mplot3d.art3d that contains 3D artist code and functions to convert 2D artists into 3D versions, which can be added to an Axes3D to reorder coordinates so that the axes are rotated with zdir along. The default value is z. Prepending the axis with a '-' does the inverse transform, so zdir can be x, -x, y, -y, z, or -z. How to do it... This is the code to demonstrate the plotting concept explained in the preceding section: import random import numpy as np import matplotlib as mpl import matplotlib.pyplot as plt import matplotlib.dates as mdates from mpl_toolkits.mplot3d import Axes3D mpl.rcParams['font.size'] = 10 fig = plt.figure() ax = fig.add_subplot(111, projection='3d') for z in [2011, 2012, 2013, 2014]: xs = xrange(1,13) ys = 1000 * np.random.rand(12) color = plt.cm.Set2(random.choice(xrange(plt.cm.Set2.N))) ax.bar(xs, ys, zs=z, zdir='y', color=color, alpha=0.8) ax.xaxis.set_major_locator(mpl.ticker.FixedLocator(xs)) ax.yaxis.set_major_locator(mpl.ticker.FixedLocator(ys)) ax.set_xlabel('Month') ax.set_ylabel('Year') ax.set_zlabel('Sales Net [usd]') plt.show() This code produces the following figure: How it works... We had to do the same prep work as in 2D world. Difference here is that we needed to specify what "kind of backend." Then, we generate random data for supposed 4 years of sale (2011–2014). We needed to specify Z values to be the same for the 3D axis. The color we picked randomly from the color map set, and then we associated each Z order collection of xs, ys pairs we would render the bar series. There's more... Other plotting from 2D matplotlib are available here. For example, scatter() has a similar interface to plot(), but with added size of the point marker. We are also familiar with contour, contourf, and bar. New types that are available only in 3D are wireframe, surface, and tri-surface plots. For example, this code example, plots tri-surface plot of popular Pringle functions or, more mathematically, hyperbolic paraboloid:  from mpl_toolkits.mplot3d import Axes3D from matplotlib import cm import matplotlib.pyplot as plt import numpy as np n_angles = 36 n_radii = 8 # An array of radii # Does not include radius r=0, this is to eliminate duplicate points radii = np.linspace(0.125, 1.0, n_radii) # An array of angles angles = np.linspace(0, 2*np.pi, n_angles, endpoint=False) # Repeat all angles for each radius angles = np.repeat(angles[...,np.newaxis], n_radii, axis=1) # Convert polar (radii, angles) coords to cartesian (x, y) coords # (0, 0) is added here. There are no duplicate points in the (x, y) plane x = np.append(0, (radii*np.cos(angles)).flatten()) y = np.append(0, (radii*np.sin(angles)).flatten()) # Pringle surface z = np.sin(-x*y) fig = plt.figure() ax = fig.gca(projection='3d') ax.plot_trisurf(x, y, z, cmap=cm.jet, linewidth=0.2) plt.show()  The code will give the following output:    Summary Python Data Visualization Cookbook, Second Edition, is for developers that already know about Python programming in general. If you have heard about data visualization but you don't know where to start, then the book will guide you from the start and help you understand data, data formats, data visualization, and how to use Python to visualize data. Many more visualization techniques have been illustrated in a step-by-step recipe-based approach to data visualization in the book. The topics are explained sequentially as cookbook recipes consisting of a code snippet and the resulting visualization. Resources for Article: Further resources on this subject: Basics of Jupyter Notebook and Python [article] Asynchronous Programming with Python [article] Introduction to Data Analysis and Libraries [article]
Read more
  • 0
  • 0
  • 1950