Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-getting-to-know-and-manipulate-tensors-in-tensorflow

29 Dec 2017

5 min read

Getting to know and manipulate Tensors in TensorFlow

29 Dec 2017

[box type="note" align="" class="" width=""]This article is a book excerpt written by Rodolfo Bonnin, titled Building Machine Learning Projects with TensorFlow. In this book, you will learn to build powerful machine learning projects to tackle complex data for gaining valuable insights. [/box] Today, you will learn everything about Tensors, their properties and how they are used to represent data. What are tensors? TensorFlow bases its data management on tensors. Tensors are concepts from the field of mathematics, and are developed as a generalization of the linear algebra terms of vectors and matrices. Talking specifically about TensorFlow, a tensor is just a typed, multidimensional array, with additional operations, modeled in the tensor object. Tensor properties - ranks, shapes, and types TensorFlow uses tensor data structure to represent all data. Any tensor has a static type and dynamic dimensions, so you can change a tensor's internal organization in real-time. Another property of tensors, is that only objects of the tensor type can be passed between nodes in the computation graph. Let's now see what the properties of tensors are (from now on, every time we use the word tensor, we'll be referring to TensorFlow's tensor objects). Tensor rank Tensor ranks represent the dimensional aspect of a tensor, but is not the same as a matrix rank. It represents the quantity of dimensions in which the tensor lives, and is not a precise measure of the extension of the tensor in rows/columns or spatial equivalents. A rank one tensor is the equivalent of a vector, and a rank one tensor is a matrix. For a rank two tensor you can access any element with the syntax t[i, j]. For a rank three tensor you would need to address an element with t[i, j, k], and so on. In the following example, we will create a tensor, and access one of its components: import tensorflow as tf sess = tf.Session() tens1 = tf.constant([[[1,2],[2,3]],[[3,4],[5,6]]]) print sess.run(tens1)[1,1,0] Output: 5 This is a tensor of rank three, because in each element of the containing matrix, there is a vector element: Rank Math entity Code definition example 0 Scalar scalar = 1000 1 Vector Vector vector = [2, 8, 3] 2 Matrix matrix = [[4, 2, 1], [5, 3, 2], [5, 5, 6]] 3 3-tensor tensor = [[[4], [3], [2]], [[6], [100], [4]], [[5], [1], [4]]] n n-tensor … Tensor shape The TensorFlow documentation uses three notational conventions to describe tensor dimensionality: rank, shape, and dimension number. The following table shows how these relate to one another: Rank Shape Dimension number Example 0 [] 0 4 1 [D0] 1 [2] 2 [D0, D1] 2 [6, 2] 3 [D0, D1, D2] 3 [7, 3, 2] n [D0, D1, … Dn-1] n-D A tensor with shape [D0, D1, … Dn-1] In the following example, we create a sample rank three tensor, and print the shape of it: Tensor data types In addition to dimensionality, tensors have a fixed data type. You can assign any one of the following data types to a tensor: Data type Python type Description DT_FLOAT tf.float32 32 bits floating point. DT_DOUBLE tf.float64 64 bits floating point. DT_INT8 tf.int8 8 bits signed integer. DT_INT16 tf.int16 16 bits signed integer. DT_INT32 tf.int32 32 bits signed integer. DT_INT64 tf.int64 64 bits signed integer. DT_UINT8 tf.uint8 8 bits unsigned integer. DT_STRING tf.string Variable length byte arrays. Each element of a tensor is a byte array. DT_BOOL tf.bool Boolean. Creating new tensors We can either create our own tensors, or derivate them from the well-known numpy library. In the following example, we create some numpy arrays, and do some basic math with them: import tensorflow as tf import numpy as np x = tf.constant(np.random.rand(32).astype(np.float32)) y= tf.constant ([1,2,3]) x Y Output: <tf.Tensor 'Const_2:0' shape=(3,) dtype=int32> From numpy to tensors and vice versa TensorFlow is interoperable with numpy, and normally the eval() function calls will return a numpy object, ready to be worked with the standard numerical tools. We must note that the tensor object is a symbolic handle for the result of an operation, so it doesn't hold the resulting values of the structures it contains. For this reason, we must run the eval() method to get the actual values, which is the equivalent to Session.run(tensor_to_eval). In this example, we build two numpy arrays, and convert them to tensors: import tensorflow as tf #we import tensorflow import numpy as np #we import numpy sess = tf.Session() #start a new Session Object x_data = np.array([[1.,2.,3.],[3.,2.,6.]]) # 2x3 matrix x = tf.convert_to_tensor(x_data, dtype=tf.float32) print (x) Output: Tensor("Const_3:0", shape=(2, 3), dtype=float32) Useful method: tf.convert_to_tensor: This function converts Python objects of various types to tensor objects. It accepts tensorobjects, numpy arrays, Python lists, and Python scalars. Getting things done - interacting with TensorFlow As with the majority of Python's modules, TensorFlow allows the use of Python's interactive console: In the previous figure, we call the Python interpreter (by simply calling Python) and create a tensor of constant type. Then we invoke it again, and the Python interpreter shows the shape and type of the tensor. We can also use the IPython interpreter, which will allow us to employ a format more compatible with notebook-style tools, such as Jupyter: When talking about running TensorFlow Sessions in an interactive manner, it's better to employ the InteractiveSession object. Unlike the normal tf.Session class, the tf.InteractiveSession class installs itself as the default session on construction. So when you try to eval a tensor, or run an operation, it will not be necessary to pass a Session object to indicate which session it refers to. To summarize, we have learned about tensors, the key data structure in TensorFlow and simple operations we can apply to the data. To know more about different machine learning techniques and algorithms that can be used to build efficient and powerful projects, you can refer to the book Building Machine Learning Projects with TensorFlow.

0
0
10553

article-image-6-index-types-in-postgresql-10-you-should-know

Sugandha Lahoti

28 Feb 2018

13 min read

6 index types in PostgreSQL 10 you should know

Sugandha Lahoti

28 Feb 2018

13 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book Mastering PostgreSQL 10 written by Hans-Jürgen Schönig. This book will help you master the capabilities of PostgreSQL 10 to efficiently manage and maintain your database.[/box] In today’s post, we will learn about the different index types available for sorting in PostgreSQL and also understand how they function. What are index types and why you need them Data types can be sorted in a useful way. Just imagine a polygon. How would you sort these objects in a useful way? Sure, you can sort by the area covered, its length or so, but doing this won't allow you to actually find them using a geometric search. The solution to the problem is to provide more than just one index type. Each index will serve a special purpose and do exactly what is needed. The following six index types are available (as of PostgreSQL 10.0): test=# SELECT * FROM pg_am; amname | amhandler | amtype ---------+-------------+-------- btree | bthandler | i hash | hashhandler | i GiST | GiSThandler | i Gin | ginhandler | i spGiST | spghandler | i brin | brinhandler | i (6 rows) A closer look at the 6 index types in PostgreSQL 10 The following sections will outline the purpose of each index type available in PostgreSQL. Note that there are some extensions that can be used on top of what you can see here. Additional index types available on the web are rum, vodka, and in the future, cognac. Hash indexes Hash indexes have been around for many years. The idea is to hash the input value and store it for later lookups. Having hash indexes actually makes sense. However, before PostgreSQL 10.0, it was not advised to use hash indexes because PostgreSQL had no WAL support for them. In PostgreSQL 10.0, this has changed. Hash indexes are now fully logged and are therefore ready for replication and are considered to be a 100% crash safe. Hash indexes are generally a bit larger than b-tree indexes. Suppose you want to index 4 million integer values. A btree will need around 90 MB of storage to do this. A hash index will need around 125 MB on disk. The assumption made by many people that a hash is super small on the disk is therefore, in many cases, just wrong. GiST indexes Generalized Search Tree (GiST) indexes are highly important index types because they are used for a variety of different things. GiST indexes can be used to implement R-tree behavior and it is even possible to act as b-tree. However, abusing GiST for b-tree indexes is not recommended. Typical use cases for GiST are as follows: Range types Geometric indexes (for example, used by the highly popular PostGIS extension) Fuzzy searching Understanding how GiST works To many people, GiST is still a black box. We will now discuss how GiST works internally. Consider the following diagram: Source: http://leopard.in.ua/assets/images/postgresql/pg_indexes/pg_indexes2.jpg Take a look at the tree. You will see that R1 and R2 are on top. R1 and R2 are the bounding boxes containing everything else. R3, R4, and R5 are contained by R1. R8, R9, and R10 are contained by R3, and so on. A GiST index is therefore hierarchically organized. What you can see in the diagram is that some operations, which are not available in b-trees are supported. Some of those operations are overlaps, left of, right of, and so on. The layout of a GiST tree is ideal for geometric indexing. Extending GiST Of course, it is also possible to come up with your own operator classes. The following strategies are supported: Operation Strategy number Strictly left of 1 Does not extend to right of 2 Overlaps 3 Does not extend to left of 4 Strictly right of 5 Same 6 Contains 7 Contained by 8 Does not extend above 9 Strictly below 10 Strictly above 11 Does not extend below 12 If you want to write operator classes for GiST, a couple of support functions have to be provided. In the case of a b-tree, there is only the same function - GiST indexes provide a lot more: Function Description Support function number consistent The functions determine whether a key satisfies the query qualifier. Internally, strategies are looked up and checked. 1 union Calculate the union of a set of keys. In case of numeric values, simply the upper and lower values or a range are computed. It is especially important to geometries. 2 compress Compute a compressed representation of a key or value. 3 decompress This is the counterpart of the compress function. 4 penalty During insertion, the cost of inserting into the tree will be calculated. The cost determines where the new entry will go inside the tree. Therefore, a good penalty function is key to the good overall performance of the index. 5 picksplit Determines where to move entries in case of a page split. Some entries have to stay on the old page while others will go to the new page being created. Having a good picksplit function is essential to a good index performance. 6 equal The equal function is similar to the same function you have already seen in b-trees. 7 distance Calculates the distance (a number) between a key and the query value. The distance function is optional and is needed in case KNN search is supported. 8 fetch Determine the original representation of a compressed key. This function is needed to handle index only scans as supported by the recent version of PostgreSQL. 9 Implementing operator classes for GiST indexes is usually done in C. If you are interested in a good example, I advise you to check out the btree_GiST module in the contrib directory. It shows how to index standard data types using GiST and is a good source of information as well as inspiration. GIN indexes Generalized inverted (GIN) indexes are a good way to index text. Suppose you want to index a million text documents. A certain word may occur millions of times. In a normal b- tree, this would mean that the key is stored millions of times. Not so in a GIN. Each key (or word) is stored once and assigned to a document list. Keys are organized in a standard b- tree. Each entry will have a document list pointing to all entries in the table having the same key. A GIN index is very small and compact. However, it lacks an important feature found in the b-trees-sorted data. In a GIN, the list of item pointers associated with a certain key is sorted by the position of the row in the table and not by some arbitrary criteria. Extending GIN Just like any other index, GIN can be extended. The following strategies are available: Operation Strategy number Overlap 1 Contains 2 Is contained by 3 Equal 4 On top of this, the following support functions are available: Function Description Support function number compare The compare function is similar to the same function you have seen in b-trees. If two keys are compared, it returns -1 (lower), 0 (equal), or 1 (higher). 1 extractValue Extract keys from a value to be indexed. A value can have many keys. For example, a text value might consist of more than one word. 2 extractQuery Extract keys from a query condition. 3 consistent Check whether a value matches a query condition. 4 comparePartial Compare a partial key from a query and a key from the index. Returns -1, 0, or 1 (similar to the same function supported by b-trees). 5 triConsistent Determine whether a value matches a query condition (ternary variant). It is optional if the consistent function is present. 6 If you are looking for a good example of how to extend GIN, consider looking at the btree_gin module in the PostgreSQL contrib directory. It is a valuable source of information and a good way to start your own implementation. SP-GiST indexes Space partitioned GiST (SP-GiST) has mainly been designed for in-memory use. The reason for this is an SP-GiST stored on disk needs a fairly high number of disk hits to function. Disk hits are way more expensive than just following a couple of pointers in RAM. The beauty is that SP-GiST can be used to implement various types of trees such as quad- trees, k-d trees, and radix trees (tries). The following strategies are provided: Operation Strategy number Strictly left of 1 Strictly right of 5 Same 6 Contained by 8 Strictly below 10 Strictly above 11 To write your own operator classes for SP-GiST, a couple of functions have to be provided: Function Description Support function number config Provides information about the operator class in use 1 choose Figures out how to insert a new value into an inner tuple 2 picksplit Figures out how to partition/split a set of values 3 inner_consistent Determine which subpartitions need to be searched for a query 4 leaf_consistent Determine whether key satisfies the query qualifier 5 BRIN indexes Block range indexes (BRIN) are of great practical use. All indexes discussed until now need quite a lot of disk space. Although a lot of work has gone into shrinking GIN indexes and the like, they still need quite a lot because an index pointer is needed for each entry. So, if there are 10 million entries, there will be 10 million index pointers. Space is the main concern addressed by the BRIN indexes. A BRIN index does not keep an index entry for each tuple but will store the minimum and the maximum value of 128 (default) blocks of data (1 MB). The index is therefore very small but lossy. Scanning the index will return more data than we asked for. PostgreSQL has to filter out these additional rows in a later step. The following example demonstrates how small a BRIN index really is: test=# CREATE INDEX idx_brin ON t_test USING brin(id); CREATE INDEX test=# di+ idx_brin List of relations Schema | Name | Type | Owner | Table | Size --------+----------+-------+-------+--------+-------+------------- public | idx_brin | index | hs | t_test | 48 KB (1 row) In my example, the BRIN index is 2,000 times smaller than a standard b-tree. The question naturally arising now is, why don't we always use BRIN indexes? To answer this kind of question, it is important to reflect on the layout of BRIN; the minimum and maximum value for 1 MB are stored. If the data is sorted (high correlation), BRIN is pretty efficient because we can fetch 1 MB of data, scan it, and we are done. However, what if the data is shuffled? In this case, BRIN won't be able to exclude chunks of data anymore because it is very likely that something close to the overall high and the overall low is within 1 MB of data. Therefore, BRIN is mostly made for highly correlated data. In reality, correlated data is quite likely in data warehousing applications. Often, data is loaded every day and therefore dates can be highly correlated. Extending BRIN indexes BRIN supports the same strategies as a b-tree and therefore needs the same set of operators. The code can be reused nicely: Operation Strategy number Less than 1 Less than or equal 2 Equal 3 Greater than or equal 4 Greater than 5 The support functions needed by BRIN are as follows: Function Description Support function number opcInfo Provide internal information about the indexed columns 1 add_value Add an entry to an existing summary tuple 2 consistent Check whether a value matches a condition 3 union Calculate the union of two summary entries (minimum/maximum values) 4 Adding additional indexes Since PostgreSQL 9.6, there has been an easy way to deploy entirely new index types as extensions. This is pretty cool because if those index types provided by PostgreSQL are not enough, it is possible to add additional ones serving precisely your purpose. The instruction to do this is CREATE ACCESS METHOD: test=# h CREATE ACCESS METHOD Command: CREATE ACCESS METHOD Description: define a new access method Syntax: CREATE ACCESS METHOD name TYPE access_method_type HANDLER handler_function Don't worry too much about this command—just in case you ever deploy your own index type, it will come as a ready-to-use extension. One of these extensions implements bloom filters. Bloom filters are probabilistic data structures. They sometimes return too many rows but never too few. Therefore, a bloom filter is a good method to pre-filter data. How does it work? A bloom filter is defined on a couple of columns. A bitmask is calculated based on the input values, which is then compared to your query. The upside of a bloom filter is that you can index as many columns as you want. The downside is that the entire bloom filter has to be read. Of course, the bloom filter is smaller than the underlying data and so it is, in many cases, very beneficial. To use bloom filters, just activate the extension, which is a part of the PostgreSQL contrib package: test=# CREATE EXTENSION bloom; CREATE EXTENSION As stated previously, the idea behind a bloom filter is that it allows you to index as many columns as you want. In many real-world applications, the challenge is to index many columns without knowing which combinations the user will actually need at runtime. In the case of a large table, it is totally impossible to create standard b-tree indexes on, say, 80 fields or more. A bloom filter might be an alternative in this case: test=# CREATE TABLE t_bloom (x1 int, x2 int, x3 int, x4 int, x5 int, x6 int, x7 int); CREATE TABLE Creating the index is easy: test=# CREATE INDEX idx_bloom ON t_bloom USING bloom(x1, x2, x3, x4, x5, x6, x7); CREATE INDEX If sequential scans are turned off, the index can be seen in action: test=# SET enable_seqscan TO off; SET test=# explain SELECT * FROM t_bloom WHERE x5 = 9 AND x3 = 7; QUERY PLAN ------------------------------------------------------------------------- Bitmap Heap Scan on t_bloom (cost=18.50..22.52 rows=1 width=28) Recheck Cond: ((x3 = 7) AND (x5 = 9)) -> Bitmap Index Scan on idx_bloom (cost=0.00..18.50 rows=1 width=0) Index Cond: ((x3 = 7) AND (x5 = 9)) Note that I have queried a combination of random columns; they are not related to the actual order in the index. The bloom filter will still be beneficial. If you are interested in bloom filters, consider checking out the website: https://en.wikipedia.org/wiki/Bloom_filter. We learnt how to use the indexing features in PostgreSQL and fine-tune the performance of our queries. If you liked our article, check out the book Mastering PostgreSQL 10 to implement advanced administrative tasks such as server maintenance and monitoring, replication, recovery, high availability, etc in PostgreSQL 10.

0
0
10552

article-image-how-to-perform-numeric-metric-aggregations-with-elasticsearch

Pravin Dhandre

22 Feb 2018

7 min read

How to perform Numeric Metric Aggregations with Elasticsearch

Pravin Dhandre

22 Feb 2018

7 min read

[box type="note" align="" class="" width=""]This article is an excerpt from the book Learning Elastic Stack 6.0 written by Pranav Shukla and Sharath Kumar M N . This book provides detailed coverage on fundamentals of each components of Elastic Stack, making it easy to search, analyze and visualize data across different sources in real-time.[/box] Today, we are going to demonstrate how to run numeric and statistical queries such as summation, average, count and various similar metric aggregations on Elastic Stack to serve a better analytics engine on your dataset. Metric aggregations Metric aggregations work with numeric data, computing one or more aggregate metrics within the given context. The context could be a query, filter, or no query to include the whole index/type. Metric aggregations can also be nested inside other bucket aggregations. In this case, these metrics will be computed for each bucket in the bucket aggregations. We will start with simple metric aggregations without nesting them inside bucket aggregations. When we learn about bucket aggregations later in the chapter, we will also learn how to use metric aggregations inside bucket aggregations. We will learn about the following metric aggregations: Sum, average, min, and max aggregations Stats and extended stats aggregations Cardinality aggregation Let us learn about them one by one. Sum, average, min, and max aggregations Finding the sum of a field, the minimum value for a field, the maximum value for a field, or an average, are very common operations. For the people who are familiar with SQL, the query to find the sum would look like the following: SELECT sum(downloadTotal) FROM usageReport; The preceding query will calculate the sum of the downloadTotal field across all records in the table. This requires going through all records of the table or all records in the given context and adding the values of the given fields. In Elasticsearch, a similar query can be written using the sum aggregation. Let us understand the sum aggregation first. Sum aggregation Here is how to write a simple sum aggregation: GET bigginsight/_search { "aggregations": { 1 "download_sum": { 2 "sum": { 3 "field": "downloadTotal" 4 } } }, "size": 0 5 } The aggs or aggregations element at the top level should wrap any aggregation. Give a name to the aggregation; here we are doing the sum aggregation on the downloadTotal field and hence the name we chose is download_sum. You can name it anything. This field will be useful while looking up this particular aggregation's result in the response. We are doing a sum aggregation, hence the sum element. We want to do term aggregation on the downloadTotal field. Specify size = 0 to prevent raw search results from being returned. We just want aggregation results and not the search results in this case. Since we haven't specified any top level query elements, it matches all documents. We do not want any raw documents (or search hits) in the result. The response should look like the following: { "took": 92, ... "hits": { "total": 242836, 1 "max_score": 0, "hits": [] }, "aggregations": { 2 "download_sum": { 3 "value": 2197438700 4 } } } Let us understand the key aspects of the response. The key parts are numbered 1, 2, 3, and so on, and are explained in the following points: The hits.total element shows the number of documents that were considered or were in the context of the query. If there was no additional query or filter specified, it will include all documents in the type or index. Just like the request, this response is wrapped inside aggregations to indicate as Such. The response of the aggregation requested by us was named download_sum, hence we get our response from the sum aggregation inside an element with the same name. The actual value after applying the sum aggregation. The average, min, and max aggregations are very similar. Let's look at them briefly. Average aggregation The average aggregation finds an average across all documents in the querying context: GET bigginsight/_search { "aggregations": { "download_average": { 1 "avg": { 2 "field": "downloadTotal" } } }, "size": 0 } The only notable differences from the sum aggregation are as follows: We chose a different name, download_average, to make it apparent that the aggregation is trying to compute the average. The type of aggregation that we are doing is avg instead of the sum aggregation that we were doing earlier. The response structure is identical but the value field will now represent the average of the requested field. The min and max aggregations are the exactly same. Min aggregation Here is how we will find the minimum value of the downloadTotal field in the entire index/type: GET bigginsight/_search { "aggregations": { "download_min": { "min": { "field": "downloadTotal" } } }, "size": 0 } Let's finally look at max aggregation also. Max aggregation Here is how we will find the maximum value of the downloadTotal field in the entire index/type: GET bigginsight/_search { "aggregations": { "download_max": { "max": { "field": "downloadTotal" } } }, "size": 0 } These aggregations were really simple. Now let's look at some more advanced yet simple stats and extended stats aggregations. Stats and extended stats aggregations These aggregations compute some common statistics in a single request without having to issue multiple requests. This saves resources on the Elasticsearch side as well because the statistics are computed in a single pass rather than being requested multiple times. The client code also becomes simpler if you are interested in more than one of these statistics. Let's look at the stats aggregation first. Stats aggregation The stats aggregation computes the sum, average, min, max, and count of documents in a single pass: GET bigginsight/_search { "aggregations": { "download_stats": { "stats": { "field": "downloadTotal" } } }, "size": 0 } The structure of the stats request is the same as the other metric aggregations we have seen so far, so nothing special is going on here. The response should look like the following: { "took": 4, ..., "hits": { "total": 242836, "max_score": 0, "hits": [] }, "aggregations": { "download_stats": { "count": 242835, "min": 0, "max": 241213, "avg": 9049.102065188297, "sum": 2197438700 } } } As you can see, the response with the download_stats element contains count, min, max, average, and sum; everything is included in the same response. This is very handy as it reduces the overhead of multiple requests and also simplifies the client code. Let us look at the extended stats aggregation. Extended stats Aggregation The extended stats aggregation returns a few more statistics in addition to the ones returned by the stats aggregation: GET bigginsight/_search { "aggregations": { "download_estats": { "extended_stats": { "field": "downloadTotal" } } }, "size": 0 } The response looks like the following: { "took": 15, "timed_out": false, ..., "hits": { "total": 242836, "max_score": 0, "hits": [] }, "aggregations": { "download_estats": { "count": 242835, "min": 0, "max": 241213, "avg": 9049.102065188297, "sum": 2197438700, "sum_of_squares": 133545882701698, "variance": 468058704.9782911, "std_deviation": 21634.664429528162, "std_deviation_bounds": { "upper": 52318.43092424462, "lower": -34220.22679386803 } } } } It also returns the sum of squares, variance, standard deviation, and standard deviation Bounds. Cardinality aggregation Finding the count of unique elements can be done with the cardinality aggregation. It is similar to finding the result of a query such as the following: select count(*) from (select distinct username from usageReport) u; Finding the cardinality or the number of unique values for a specific field is a very common requirement. If you have click-stream from the different visitors on your website, you may want to find out how many unique visitors you got in a given day, week, or month. Let us understand how we find out the count of unique users for which we have network traffic data: GET bigginsight/_search { "aggregations": { "unique_visitors": { "cardinality": { "field": "username" } } }, "size": 0 } The cardinality aggregation response is just like the other metric aggregations: { "took": 110, ..., "hits": { "total": 242836, "max_score": 0, "hits": [] }, "aggregations": { "unique_visitors": { "value": 79 } } } To summarize, we learned how to perform numerous metric aggregations on numeric datasets and easily deploy elasticsearch in building powerful analytics application. If you found this tutorial useful, do check out the book Learning Elastic Stack 6.0 to examine the fundamentals of Elastic Stack in detail and start developing solutions for problems like logging, site search, app search, metrics and more.

0
0
10533

article-image-classify-emails-using-deep-neural-networks-generating-tf-idf

Savia Lobo

21 Feb 2018

9 min read

How to classify emails using deep neural networks after generating TF-IDF

Savia Lobo

21 Feb 2018

9 min read

[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Natural Language Processing with Python Cookbook written by Krishna Bhavsar, Naresh Kumar, and Pratap Dangeti. This book will teach you how to efficiently use NLTK and implement text classification, identify parts of speech, tag words, and more. You will also learn how to analyze sentence structures and master lexical analysis, syntactic and semantic analysis, pragmatic analysis, and application of deep learning techniques.[/box] In this article, you will learn how to use deep neural networks to classify emails into one of the 20 pre-trained categories based on the words present in each email. This is a simple model to start with understanding the subject of deep learning and its applications on NLP. Getting ready The 20 newsgroups dataset from scikit-learn have been utilized to illustrate the concept. Number of observations/emails considered for analysis are 18,846 (train observations - 11,314 and test observations - 7,532) and its corresponding classes/categories are 20, which are shown in the following: >>> from sklearn.datasets import fetch_20newsgroups >>> newsgroups_train = fetch_20newsgroups(subset='train') >>> newsgroups_test = fetch_20newsgroups(subset='test') >>> x_train = newsgroups_train.data >>> x_test = newsgroups_test.data >>> y_train = newsgroups_train.target >>> y_test = newsgroups_test.target >>> print ("List of all 20 categories:") >>> print (newsgroups_train.target_names) >>> print ("n") >>> print ("Sample Email:") >>> print (x_train[0]) >>> print ("Sample Target Category:") >>> print (y_train[0]) >>> print (newsgroups_train.target_names[y_train[0]]) In the following screenshot, a sample first data observation and target class category has been shown. From the first observation or email we can infer that the email is talking about a two-door sports car, which we can classify manually into autos category which is 8. Note: Target value is 7 due to the indexing starts from 0), which is validating our understanding with actual target class 7. How to do it… Using NLP techniques, we have pre-processed the data for obtaining finalized word vectors to map with final outcomes spam or ham. Major steps involved are: Pre-processing. Removal of punctuations. Word tokenization. Converting words into lowercase. Stop word removal. Keeping words of length of at least 3. Stemming words. POS tagging. Lemmatization of words: TF-IDF vector conversion. Deep learning model training and testing. Model evaluation and results discussion. How it works... The NLTK package has been utilized for all the pre-processing steps, as it consists of all the necessary NLP functionality under one single roof: # Used for pre-processing data >>> import nltk >>> from nltk.corpus import stopwords >>> from nltk.stem import WordNetLemmatizer >>> import string >>> import pandas as pd >>> from nltk import pos_tag >>> from nltk.stem import PorterStemmer The function written (pre-processing) consists of all the steps for convenience. However, we will be explaining all the steps in each section: >>> def preprocessing(text): The following line of the code splits the word and checks each character to see if it contains any standard punctuations, if so it will be replaced with a blank or else it just don't replace with blank: ... text2 = " ".join("".join([" " if ch in string.punctuation else ch for ch in text]).split()) The following code tokenizes the sentences into words based on whitespaces and puts them together as a list for applying further steps: ... tokens = [word for sent in nltk.sent_tokenize(text2) for word in nltk.word_tokenize(sent)] Converting all the cases (upper, lower and proper) into lower case reduces duplicates in corpus: ... tokens = [word.lower() for word in tokens] As mentioned earlier, Stop words are the words that do not carry much of weight in understanding the sentence; they are used for connecting words and so on. We have removed them with the following line of code: ... stopwds = stopwords.words('english') ... tokens = [token for token in tokens if token not in stopwds] Keeping only the words with length greater than 3 in the following code for removing small words which hardly consists of much of a meaning to carry; ... tokens = [word for word in tokens if len(word)>=3] Stemming applied on the words using Porter stemmer which stems the extra suffixes from the words: ... stemmer = PorterStemmer() ... tokens = [stemmer.stem(word) for word in tokens] POS tagging is a prerequisite for lemmatization, based on whether word is noun or verb or and so on. it will reduce it to the root word ... tagged_corpus = pos_tag(tokens) pos_tag function returns the part of speed in four formats for Noun and six formats for verb. NN - (noun, common, singular), NNP - (noun, proper, singular), NNPS - (noun, proper, plural), NNS - (noun, common, plural), VB - (verb, base form), VBD - (verb, past tense), VBG - (verb, present participle), VBN - (verb, past participle), VBP - (verb, present tense, not 3rd person singular), VBZ - (verb, present tense, third person singular) ... Noun_tags = ['NN','NNP','NNPS','NNS'] ... Verb_tags = ['VB','VBD','VBG','VBN','VBP','VBZ'] ... lemmatizer = WordNetLemmatizer() The following function, prat_lemmatize, has been created only for the reasons of mismatch between the pos_tag function and intake values of lemmatize function. If the tag for any word falls under the respective noun or verb tags category, n or v will be applied accordingly in lemmatize function: ... def prat_lemmatize(token,tag): ... if tag in Noun_tags: ... return lemmatizer.lemmatize(token,'n') ... elif tag in Verb_tags: ... return lemmatizer.lemmatize(token,'v') ... else: ... return lemmatizer.lemmatize(token,'n') After performing tokenization and applied all the various operations, we need to join it back to form stings and the following function performs the same: ... pre_proc_text = " ".join([prat_lemmatize(token,tag) for token,tag in tagged_corpus]) ... return pre_proc_text Applying pre-processing on train and test data: >>> x_train_preprocessed = [] >>> for i in x_train: ... x_train_preprocessed.append(preprocessing(i)) >>> x_test_preprocessed = [] >>> for i in x_test: ... x_test_preprocessed.append(preprocessing(i)) # building TFIDF vectorizer >>> from sklearn.feature_extraction.text import TfidfVectorizer >>> vectorizer = TfidfVectorizer(min_df=2, ngram_range=(1, 2), stop_words='english', max_features= 10000,strip_accents='unicode', norm='l2') >>> x_train_2 = vectorizer.fit_transform(x_train_preprocessed).todense() >>> x_test_2 = vectorizer.transform(x_test_preprocessed).todense() After the pre-processing step has been completed, processed TF-IDF vectors have to be sent to the following deep learning code: # Deep Learning modules >>> import numpy as np >>> from keras.models import Sequential >>> from keras.layers.core import Dense, Dropout, Activation >>> from keras.optimizers import Adadelta,Adam,RMSprop >>> from keras.utils import np_utils The following image produces the output after firing up the preceding Keras code. Keras has been installed on Theano, which eventually works on Python. A GPU with 6 GB memory has been installed with additional libraries (CuDNN and CNMeM) for four to five times faster execution, with a choking of around 20% memory; hence only 80% memory out of 6 GB is available; The following code explains the central part of the deep learning model. The code is self- explanatory, with the number of classes considered 20, batch size 64, and number of epochs to train, 20: # Definition hyper parameters >>> np.random.seed(1337) >>> nb_classes = 20 >>> batch_size = 64 >>> nb_epochs = 20 The following code converts the 20 categories into one-hot encoding vectors in which 20 columns are created and the values against the respective classes are given as 1. All other classes are given as 0: >>> Y_train = np_utils.to_categorical(y_train, nb_classes) In the following building blocks of Keras code, three hidden layers (1000, 500, and 50 neurons in each layer respectively) are used, with dropout as 50% for each layer with Adam as an optimizer: #Deep Layer Model building in Keras #del model >>> model = Sequential() >>> model.add(Dense(1000,input_shape= (10000,))) >>> model.add(Activation('relu')) >>> model.add(Dropout(0.5)) >>> model.add(Dense(500)) >>> model.add(Activation('relu')) >>> model.add(Dropout(0.5)) >>> model.add(Dense(50)) >>> model.add(Activation('relu')) >>> model.add(Dropout(0.5)) >>> model.add(Dense(nb_classes)) >>> model.add(Activation('softmax')) >>> model.compile(loss='categorical_crossentropy', optimizer='adam') >>> print (model.summary()) The architecture is shown as follows and describes the flow of the data from a start of 10,000 as input. Then there are 1000, 500, 50, and 20 neurons to classify the given email into one of the 20 categories: The model is trained as per the given metrics: # Model Training >>> model.fit(x_train_2, Y_train, batch_size=batch_size, epochs=nb_epochs,verbose=1) The model has been fitted with 20 epochs, in which each epoch took about 2 seconds. The loss has been minimized from 1.9281 to 0.0241. By using CPU hardware, the time required for training each epoch may increase as a GPU massively parallelizes the computation with thousands of threads/cores: Finally, predictions are made on the train and test datasets to determine the accuracy, precision, and recall values: #Model Prediction >>> y_train_predclass = model.predict_classes(x_train_2,batch_size=batch_size) >>> y_test_predclass = model.predict_classes(x_test_2,batch_size=batch_size) >>> from sklearn.metrics import accuracy_score,classification_report >>> print ("nnDeep Neural Network - Train accuracy:"),(round(accuracy_score( y_train, y_train_predclass),3)) >>> print ("nDeep Neural Network - Test accuracy:"),(round(accuracy_score( y_test,y_test_predclass),3)) >>> print ("nDeep Neural Network - Train Classification Report") >>> print (classification_report(y_train,y_train_predclass)) >>> print ("nDeep Neural Network - Test Classification Report") >>> print (classification_report(y_test,y_test_predclass)) It appears that the classifier is giving a good 99.9% accuracy on the train dataset and 80.7% on the test dataset. We learned the classification of emails using DNNs(Deep Neural Networks) after generating TF-IDF. If you found this post useful, do check out this book Natural Language Processing with Python Cookbook to further analyze sentence structures and application of various deep learning techniques.

0
0
10377

article-image-building-motion-charts-tableau

Ashwin Nair

31 Oct 2017

4 min read

Building Motion Charts with Tableau

Ashwin Nair

31 Oct 2017

4 min read

[box type="info" align="" class="" width=""]The following is an excerpt from the book Tableau 10 Bootcamp, Chapter 2, Interactivity – written by Joshua N. Milligan and Donabel Santos. It offers intensive training on Data Visualization and Dashboarding with Tableau 10. In this article, we will learn how to build motion charts with Tableau.[/box] Tableau is an amazing platform for achieving incredible data discovery, analysis, and Storytelling. It allows you to build fully interactive dashboards and stories with your visualizations and insights so that you can share the data story with others. Creating Motion Charts with Tableau Let`s learn how to build motion charts with Tableau. A motion chart, as its name suggests, is a chart that displays the entire trail of changes in data over time by showing movement using the X and Y-axes. It is very much similar to the doodles in our notebooks which seem to come to life after flipping through the pages. It is amazing to see the same kind of movement in action in Tableau using the Pagesshelf. It is work that feels like play. On the Pages shelf, when you drop a field, Tableau creates a sequence of pages that filters the view for each value in that field. Tableau's page control allows us to flip pages, enabling us to see our view come to life. With three predefined speed settings, we can control the speed of the flip. The three settings include one that relates to the slowest speed, the others to the fastest speed. We can also format the marks and show the marks or trails, or both, using page control. In our viz, we have used a circle for marking each year. The circle that moves to a new position each year represents the specific country's new population value. These circles are all connected by trail lines that enable us to simulate a moving time series graph by setting the mark and trail histories both to show in page control: Let's create an animated motion chart showing the population change over the years for a selected few countries: Open the Motion Chart worksheet and connect to the CO2 (Worldbank) data Source: Open Dimensions and drag Year to the Columns shelf. Open Measures and drag CO2 Emission to the Rows shelf. Right-click on the CO2 Emission axis, and change the title to CO2 Emission (metric tons per capita): In the Marks card, click on the dropdown to change the mark from Automatic to Circle. Open Dimensions and drag Country Name to Color in the Marks card. Also, drag Country Name to the Filter shelf from Dimensions Under the General tab of the Filter window, while the Select from list radio button is selected, select None. Select the Custom value list radio button, still under the General tab, and add China, Trinidad and Tobago, and United States: Click OK when done. This should close the Filter window. Open Dimensions and drag Year to Pages for adding a page control to the view. Click on the Show history checkbox to select it. Click on the drop-down beside Show history and perform the following steps: Select All for Marks to show history for Select Both for Show Using the Year page control, click on the forward arrow to play. This shows the change in the population of the three selected countries over the years. [box type="info" align="" class="" width=""]Tip - In case you ever want to loopback the animation, you can click on the dropdown on the top-right of your page control card, and select Loop Playback:[/box] Note that Tableau Server does not support the animation effect that you see when working on motion charts with Tableau Desktop. Tableau strives for zero footprints when serving the charts and dashboards on the server so that there is no additional download to enable the functionalities. So, the play control does not work the same. No need to fret though. You can click manually on the slider and have a similar effect. If you liked the above excerpt from the book Tableau 10 Bootcamp, check out the book to learn more data visualization techniques.

0
0
10353

article-image-handle-missing-data-ibm-spss-modeler

Amey Varangaonkar

21 Feb 2018

8 min read

How to handle missing data in IBM SPSS Modeler

Amey Varangaonkar

21 Feb 2018

8 min read

0
0
10346

article-image-how-to-query-sharded-data-in-mongodb

Amey Varangaonkar

26 Feb 2018

6 min read

How to query sharded data in MongoDB

Amey Varangaonkar

26 Feb 2018

6 min read

[box type="note" align="" class="" width=""]The following excerpt is taken from the book Mastering MongoDB 3.x written by Alex Giamas. This book covers the essential as well as advanced administration concepts in MongoDB.[/box] Querying data using a MongoDB shard is different than a single server deployment or a replica set. Instead of connecting to the single server or the primary of the replica set, we connect to the mongos router which decides which shard to ask for our data. In this article, we will explore how the MongoDB query router operates and showcase how the process is similar to working with a replica set, using Ruby. The query router The query router, also known as mongos process, acts as the interface and entry point to our MongoDB cluster. Applications connect to it instead of connecting to the underlying shards and replica sets; mongos executes queries, gathers results, and passes them to our application. mongos doesn't hold any persistent state and is typically low on system resources, and is typically hosted in the same instance as the application server. It is acting as a proxy for requests. When a query comes in, mongos will examine and decide which shards need to execute the query and establish a cursor in each one of them. The Find operation If our query includes the shard key or a prefix of the shard key, mongos will perform a targeted operation, only querying the shards that hold the keys that we are looking for. For example, with a composite shard key of {_id, email, address} on our collection User, we can have a targeted operation with any of the following queries: > db.User.find({_id: 1}) > db.User.find({_id: 1, email: '[email protected]'}) > db.User.find({_id: 1, email: '[email protected]', address: 'Linwood Dunn'}) All three of them are either a prefix (the first two) or the complete shard key. On the other hand, a query on {email, address} or {address} will not be able to target the right shards, resulting in a broadcast operation. A broadcast operation is any operation that doesn't include the shard key or a prefix of the shard key and results in mongos querying every shard and gathering results from them. It's also known as a scatter-and-gather operation or a fanout query. Sort/limit/skip operations If we want to sort our results, there are two options: If we are using the shard key in our sort criteria, then mongos can determine the order in which it has to query the shard or shards. This results in an efficient and, again, targeted operation. If we are not using the shard key in our sort criteria, then as with a query without sort, it's going to be a fanout query. To sort the results when we are not using the shard key, the primary shard executes a distributed merge sort locally before passing on the sorted result set to mongos. Limit on queries is enforced on each individual shard and then again at the mongos level as there may be results from multiple shards. Skip, on the other hand, cannot be passed on to individual shards and will be applied by mongos after retrieving all the results locally. Update/remove operations In document modifier operations like update and remove, we have a similar situation to find. If we have the shard key in the find section of the modifier, then mongos can direct the query to the relevant shard. If we don't have the shard key in the find section, then it will again be a fanout operation. In essence, we have the following cases for operations with sharding: Type of operation Query topology insert Must have the shard key update Can have the shard key Query with shard key Targeted operation Query without shard key Scatter gather/fanout query Indexed/sorted query with shard key Targeted operation Indexed/sorted query without shard key Distributed sort merge Querying using Ruby Connecting to a sharded cluster using Ruby is no different than connecting to a replica set. Using the Ruby official driver we have to configure the client object to define the set of mongos servers: client = Mongo::Client.new('mongodb://key:password@mongos-server1- Host:mongos-server1-port,mongos-server2-host:mongos-server2- port/admin?ssl=true&authSource=admin') The mongo-ruby-driver will then return a client object that is no different than connecting to a replica set from the Mongo Ruby client. We can then use the client object like we did in previous chapters, with all the caveats around how sharding behaves differently than a standalone server or a replica set with regards to querying and performance. Performance comparison with replica sets Developers and architects are always looking out for ways to compare performance between replica sets and sharded configurations. The way MongoDB implements sharding, it is based on top of replica sets. Every shard in production should be a replica set. The main difference in performance comes from fan out queries. When we are querying without the shard key, MongoDB's execution time is limited by the worst-performing replica set. In addition, when using sorting without the shard key, the primary server has to implement the distributed merge sort on the entire dataset. This means that it has to collect all data from different shards, merge-sort them, and pass them as sorted to mongos. In both cases, network latency and limitations in bandwidth can slow down operations as opposed to a replica set. On the flip side, by having three shards, we can distribute our working set requirements across different nodes, thus serving results from RAM instead of reaching out to the underlying storage, HDD or SSD. On the other hand, writes can be sped up significantly since we are no longer bound by a single node's I/O capacity but we can have writes in as many nodes as there are shards. To sum up, in most cases and especially for the cases that we are using the shard key, both queries and modification operations will be significantly sped up by sharding. If you found this post useful, check out our book Mastering MongoDB 3.x for more tips and techniques on sharding, replication and other database administration tasks related to MongoDB.

0
0
10335

article-image-5-reasons-government-should-regulate-technology

Richard Gall

17 Jul 2018

6 min read

5 reasons government should regulate technology

Richard Gall

17 Jul 2018

6 min read

Microsoft's Brad Smith took the unprecedented move last week of calling for government to regulate facial recognition technology. In an industry that has resisted government intervention, it was a bold yet humble step. It was a way of saying "we can't deal with this on our own." There will certainly be people who disagree with Brad Smith. For some the entrepreneurial spirit that is central to tech and startup culture will only be stifled by regulation. But let's be realistic about where we are at the moment - the technology industry has never faced such a crisis of confidence and met with substantial public cynicism. Perhaps government regulation is precisely what we need to move forward. Here are 4 reasons why government should regulate technology. Regulation can restore accountability and rebuild trust in tech We've said it a lot in 2018, but there really is a significant trust deficit in technology at the moment. From Cambridge Analytica scandal to AI bias, software has been making headlines in a way it never has before. This only cultivates a culture of cynicism across the public. And with talk of automation and job losses, it paints a dark picture of the future. It's no wonder that TV series like Black Mirror have such a hold over the public imagination. Of course, when used properly, technology should simply help solve problems - whether that's better consumer tech or improved diagnoses in healthcare. The problem arises when we find that there our problem-solving innovations have unintended consequences. By regulating, government can begin to think through some of these unintended consequences. But more importantly, trust can only be rebuilt once there is some degree of accountability within the industry. Think back to Zuckerberg's Congressional hearing earlier this year - while the Facebook chief may have been sweating, the real takeaway was that his power and influence was ultimately untouchable. Whatever mistakes he's made were just part and parcel of moving fast and breaking things. An apology and a humble shrug might normally pass, but with regulation, things begin to get serious. Misusing user data? We've got a law for that. Potentially earning money from people who want to undermine western democracy? We've got a law for that. Read next: Is Facebook planning to spy on you through your mobile’s microphones? Government regulation will make the conversation around the uses and abuses of technology more public Too much conversation about how and why we build technology is happening in the wrong places. Well, not the wrong places, just not enough places. The biggest decisions about technology are largely made by some of the biggest companies on the planet. All the dreams about a new democratized and open world are all but gone, as the innovations around which we build our lives come from a handful of organizations that have both financial and cultural clout. As Brad Smith argues, tech companies like Microsoft, Google, and Amazon are not the place to be having conversations about the ethical implications of certain technologies. He argues that while it's important for private companies to take more responsibility, it's an "inadequate substitute for decision making by the public and its representatives in a democratic republic." He notes that the commercial dynamics are always going to twist conversations. Companies, after all, are answerable to shareholders - only governments are accountable to the public. By regulating, the decisions we make (or don't make) about technology immediately enter into public discourse about the kind of societies we want to live in. Citizens can be better protected by tech regulation... At present, technology often advances in spite of, not because of, people. For all the talk of human-centered design, putting the customer first, every company that builds software is interested in one thing: making money. AI in particular can be dangerous for citizens For example, according to a ProPublica investigation, AI has been used to predict future crimes in the justice system. That's frightening in itself, of course, but it's particularly terrifying when you consider that criminality was falsely predicted at twice the times for black people as white people. Even in the context of social media filters, in which machine learning serves content based on a user's behavior and profile presents dangers to citizens. It gives rise to fake news and dubious political campaigning, making citizens more vulnerable to extreme - and false - ideas. By properly regulating this technology we should immediately have more transparency over how these systems work. This transparency would not only lead to more accountability in how they are built, it also ensures that changes can be made when necessary. Read next: A quick look at E.U.’s pending antitrust case against Google’s Android ...Software engineers need protection too One group haven't really been talked about when it comes to government regulation - the people actually building the software. This a big problem. If we're talking about the ethics of AI, software engineers building software are left in a vulnerable position. This is because the lines of accountability are blurred. Without a government framework that supports ethical software decision making, engineers are left in limbo. With more support for software engineers from government, they can be more confident in challenging decisions from their employers. We need to have a debate about who's responsible for the ethics of code that's written into applications today - is it the engineer? The product manager? Or the organization itself? That isn't going to be easy to answer, but some government regulation or guidance would be a good place to begin. Regulation can bridge the gap between entrepreneurs, engineers and lawmakers Times change. Years ago, technology was deployed by lawmakers as a means of control, production or exploration. That's why the military was involved with many of the innovations of the mid-twentieth century. Today, the gap couldn't be bigger. Lawmakers barely understand encryption, let alone how algorithms work. But there is also naivety in the business world too. With a little more political nous and even critical thinking, perhaps Mark Zuckerberg could have predicted the Cambridge Analytica scandal. Maybe Elon Musk would be a little more humble in the face of a coordinated rescue mission. There's clearly a problem - on the one hand, some people don't know what's already possible. For others, it's impossible to consider that something that is possible could have unintended consequences. By regulating technology, everyone will have to get to know one another. Government will need to delve deeper into the field, and entrepreneurs and engineers will need to learn more about how regulation may affect them. To some extent, this will have to be the first thing we do - develop a shared language. It might also be the hardest thing to do, too.

0
0
10133

article-image-ms-access-queries-oracle-sql-developer-12-tool

Packt

17 Mar 2010

12 min read

MS Access Queries with Oracle SQL Developer 1.2 Tool

Packt

17 Mar 2010

12 min read

In my previous article with the Oracle SQL Developer 1.1, I discussed the installation and features of this stand-alone GUI product which can be used to query several database products. Connecting to an Oracle Xe 10G was also described. The many things you do in Oracle 10G XE can also be carried out with the Oracle SQL Developer. It is expected to enhance productivity in your Oracle applications. You can use Oracle SQL Developer to connect, run, and debug SQL, SQL*Plus and PL/SQL. It can run on at least three different operating systems. Using this tool you can connect to Oracle, SQL Server, MySql and MS Access databases. In this article, you will learn how to install the Oracle SQL Developer 1.2 and connect to an MS Access database. The 1.2 version has several features that were not present in version 1.1 especially regarding Migration from other products. Downloading and installing the Oracle SQL Developer Go to the Oracle site (you need to be registered to download) and after accepting the license agreement you will be able to download sqldeveloper-1.2.2998.zip, a 77MB download if you do not have JDK1.5 already installed. You may place this in any directory. From the unzipped contents, double-click on the SQLDeveloper.exe. The User Interface On a Windows machine, you may get a security warning which you may safely override and click on Run. This opens up the splash window shown in the next picture followed by the Oracle SQL Developer interface shown in the picture that follows. Figure 1 The main window The main window of this tool is shown in the next picture. Figure 2 It has a main menu at the very top where you can access File, Edit, View, Navigate, Run, Debug, Source, Migration, Tools and Help menus. The menu item Migration has been added in this new version. Immediately below the main menu on the left, you have a tabbed window with two tabs, Connections, and Reports. This will be the item you have to contend with since most things start only after establishing a connection. The connection brings with it the various related objects in the databases. View Menu The next picture shows the drop-down of the View main menu, where you can see other details such as links to the debugger, reports, connections, and snippets. In this new version, many more items have been added such as Captured Objects, Converted Objects, and Find DB Object. Figure 3 Snippets are often-used SQL statements or clauses that you may want to insert. You may also save your snippets by clicking on the bright green plus sign in the window shown, which opens up the superposed Save Snippet window. Figure 4 In the Run menu item, you can run files as well as look at the Execution Profile. Debug Menu The debug menu item has all the necessary hooks to toggle break points: step into, step over, step out and step to End of Method, etc., including garbage collection and clean up as shown in the next picture. Figure 5 Tools Menu Tools give access to External Tools that can be launched, Exports both DDL and data, schema diff, etc. as shown in the next picture. Figure 6 Help gives you both full-text search and indexed search. This is an important area which you must visit; you can also update the help. Figure 7 About Menu The About drop-down menu item in the above figure opens up the following window where you have complete information about this product that includes version information, properties, and extensions. Figure 8 Migration Menu As mentioned earlier the Migration is a new item in the main menu and its drop-down menu elements are shown in the next picture. It even has a menu item to make a quick migration of all recent versions of MS Access (97, 2000, 2002, and 2003). The Repository Management item is another very useful feature. The MySQL and SQL Server Offline Capture menu item can capture database create scripts from several versions of MySQL and MS SQL Server by locating them on the machine. Figure 9 Connecting to a Microsoft Access Database If you are interested in Oracle 10G XE it will be helpful if you refresh your Oracle 10G XE knowledge or read the several Oracle 10G XE articles whose links are shown on the author’s blog. This is a good place for you to look at new developments, scripts, UI description, etc. This section, however, deals with connecting to an MS Access database. Click on the "Connections" icon with the bright green plus sign as shown in the next figure. Figure 10 This opens up the next window, New/Select Database Connection. This is where you must supply all the information. As you can see it has identified a resident (that is a local Oracle 10G XE server) Oracle 10G XE on the machine. Of course, you need to configure it further. In addition to Oracle, it can connect to MySQL, MS Access, and SQL Server as well. This interface has not changed much from version 1.1; you have the same control elements. Figure 11 On the left-hand side of this window, you will generate the Connection Name and Connection Details once you fill in the appropriate information on the right. Connection name is what you supply; to get connected you need to have a username and password as well. If you want, you can save the password to avoid providing it again and again. At the bottom of the screen, you can save the connection, test it and connect to it. There is also access to online help. In the above window, click on the tab, in the middle of the page, Access. The following window opens in which all you need to do is to use the Browse button to locate the Microsoft Access Database on your machine (windows default for mdb files is My Documents). Figure 12 Hitting the Browse button opens the window, Open with the default location, My Documents—the default directory for MDB files. Figure 13 Choosing a database Charts.mdb and clicking the Open button brings the file pointer to the New / Select Database Connection in the box to the left of the Browse button. When you click on the Test button if the connection is OK you should get an appropriate message. However for the Charts.mdb file you get the following error. Figure 14 The software is complaining about the lack of read access to the system tables. Providing read access to System tables. There are a couple of System tables in MS Access which are usually hidden but can be displayed using Tools option in MS Access. Figure 15 In the View tab if you place a check mark for System objects then you will see the following tables. The System tables are as shown in the red rectangle. Figure 16 If you need to modify the security settings for these tables you can do so as shown in the next figure by following the trail, Tools  Security User and Group permissions. Figure 17 Click on the User and Group Permissions menu item which opens the next window Users and Group Permissions shown here, Figure 18 For the user who is Admin, scroll through each of the system tables and place a check mark for Read Design and Read Data check boxes. Click on the OK button and close the application. Now you again use the Browse button to locate the Charts.mdb file after providing a name for the connection at the top of the New / Select Database Connection page. For this tutorial MyCharts was chosen as the name for the connection. Once this file name appears in the box to the left of the Browse button, click on the Test button. This message screen is very fast (appears and disappears). If there is a problem, it will bring up the message as before. Now click on the Connect button at the bottom of the screen in the New / Select Database Connection page window. This immediately adds the connection MyCharts to the Connections folder shown in the left. The + sign can be clicked to expand all the objects in the Charts.mdb database as shown in the next figure. Figure 19 You can further expand the Table nodes to show the data that the table contains as shown in the next figure for the Portfolio table. Figure 20 The Relationships tab in the above figure shows related and referenced objects as shown. This is just a lone table with no relationships established and therefore none showing. Figure 21 It may be noted that the Oracle SQL Developer can only connect to MDB files. It cannot connect to Microsoft Access projects (ADP files), or the new MS Access 2007 file types. Using the SQL Interface SQL Statements are run from the SQL Worksheet which can be displayed by right clicking the connection and choosing Open SQL Worksheet item from the drop-down list as shown. Figure 22 You will type in the SQL queries in area below Enter SQL Statement label in the above figure (now hidden behind the drop-down menu). Making a new connection with more tables and a relationship In order to run a few simple queries on the connected database, three more tables were imported into Charts.mdb after establishing a relationship between the new tables in the access database as shown in the following figure. Figure 23 Another connection named, NewCharts was created in Oracle SQL Developer. The connection string that SQL Developer will take for NewCharts is of the following format (some white spaces were introduced into the connection string shown to get rid of MS Word warnings). @jdbc:odbc: Driver= {Microsoft Access Driver (*.mdb)}; DBQ=C:Documents and SettingsJayMy DocumentsCharts.mdb; DriverID=22;READONLY=false} This string can be reviewed after a connection is established in the New / Select Database Connection window as shown in the next figure. Figure 24 A simple Query Let us look at a very simple query using the PrincetonTemp table. After entering the query you can execute the statement by clicking on the right pointing green arrowhead as shown. The result of running this query will appear directly below the Enter SQL Statement window as shown. Figure 25 Just above the Enter SQL Statement label is the SQL Toolbar displaying several icons (left to right) which are as follows with the associated key board access: Execute SQL Statement(F9) ->Green arrow head Run Script(F5) Commit(F11) Rollback(F12) Cancel(CTRL+Q) SQL History(F8) Execute Explain Plan(F6) Autotrace(F10) Clear(CTRL+D) It also displays time taken to execute ->0.04255061 seconds. The bottom pane is showing the result of the query in a table format. If you would run the same query with the Run Script (F5) button you would see the result in the Script Output tab of the bottom pane. In addition to SQL you can also use the SQL Worksheet to run SQL *PLUS and PL/SQL statements with some exceptions as long as they are supported by the provider used in making the connection. Viewing relationship between tables in the SQL Developer Three tables Orders, Order Details, and Products were imported into the Charts.mdb after enforcing referential integrity relationships in the Access database as seen earlier. Will the new connection NewCharts be able to see these relationships? This question is answered in the following figure. Click on any one of these tables and you will see the Data as well as Relationships tabs in the bottom pane as shown. Figure 26 Now if you click on the Relationships tab for Products you will see the display just showing three empty columns as seen in an earlier figure. However if you click on the Order Details which really links the three tables you will see the following displayed. Figure 27 Query joining three tables The Orders, Order Details, and Products tables are related by relational integrity as seen above. The following query which chooses one or two columns from each table can be run in a new SQL worksheet. Select Products.ProductName,Orders.ShipName,Orders.OrderDate,[Order Details].Quantityfrom Products, Orders, [Order Details]where Orders.OrderID=[Order Details].OrderIDandProducts.ProductID=[Order Details] and Orders.OrdersDate >'12-31-1997' The result of running this query (only four rows of data shown) can be seen in the next figure. Figure 28 Note that the syntax must match the syntax required by the Provider ODBC, date has to be #12-31-1997# instead of ’12-21-1997’. Summary The article described the new version of the stand alone Oracle’s GUI SQL Developer tool. It can connect to couple of databases such MS Access, SQL Server, Oracle and MySQL. Its utility could have been far greater had it provided connectivity to ODBC and OLE DB. I am disappointed it did not, in this version as well. The connection to MS Access seems to bring in not only tables but the other objects except Data Access Pages, but the external applications that you can use are limited to Word, Notepad, IE etc but not a Report Viewer. These objects and their utility remains to be explored. Only a limited number of features were explored in this article and it excluded new features like Migration and Translation Scratch Editor which translates MS ACCESS, SQL Server and My SQL syntaxes to PL / SQL. These will be considered in a future article.

0
0
10121

article-image-learning-the-salesforce-analytics-query-language-saql

Amey Varangaonkar

09 Mar 2018

6 min read

Learning the Salesforce Analytics Query Language (SAQL)

Amey Varangaonkar

09 Mar 2018

6 min read

Salesforce Einstein offers its own query language to retrieve your data from various sources, called Salesforce Analytics Query Language (SAQL). The lenses and dashboards in Einstein use SAQL behind the scenes to manipulate data for meaningful visualizations. In this article, we see how to use Salesforce Analytics Query Language effectively. Using SAQL There are the following three ways to use SAQL in Einstein Analytics: Creating steps/lenses: We can use SAQL while creating a lens or step. It is the easiest way of using SAQL. While creating a step, Einstein Analytics provides the flexibility of switching between modes such as Chart Mode, Table Mode, and SAQL Mode. In this chapter, we will use this method for SAQL. Analytics REST API: Using this API, the user can access the datasets, lenses, dashboards, and so on. This is a programmatic approach and you can send the queries to the Einstein Analytics platform. Einstein Analytics uses the OAuth 2.0 protocol to securely access the platform data. The OAuth protocol is a way of securely authenticating the user without asking them for credentials. The first step to using the Analytics REST API to access Analytics is to authenticate the user using OAuth 2.0. Using Dashboard JSON: We can use SAQL while editing the Dashboard JSON. We have already seen the Dashboard JSON in previous chapters. To access Dashboard JSON, you can open the dashboard in the edit mode and press Ctrl + E. The simplest way of using SAQL is while creating a step or lens. A user can switch between the modes here. To use SAQL for lens, perform the following steps: Navigate to Analytics Studio | DATASETS and select any dataset. We are going to select Opportunity here. Click on it and it will open a window to create a lens. Switch to SAQL Mode by clicking on the icon in the top-right corner, as shown in the following screenshot: In SAQL, the query is made up of multiple statements. In the first statement, the query loads the input data from the dataset, operates on it, and then finally gives the result. The user can use the Run Query button to see the results and errors after changing or adding statements. The user can see the errors at the bottom of the Query editor. SAQL is made up of statements that take the input dataset, and we build our logic on that. We can add filters, groups, orders, and so on, to this dataset to get the desired output. There are certain order rules that need to be followed while creating these statements and those rules are as follows: There can be only one offset in the foreach statement The limit statement must be after offset The offset statement must be after filter and order The order and filter statements can be swapped as there is no rule for them In SAQL, we can perform all the mathematical calculations and comparisons. SAQL also supports arithmetic operators, comparison operators, string operators, and logical operators. Using foreach in SAQL The foreach statement applies the set of expressions to every row, which is called projection. The foreach statement is mandatory to get the output of the query. The following is the syntax for the foreach statement: q = foreach q generate expression as 'expresion name'; Let's look at one example of using the foreach statement: Go to Analytics Studio | DATASETS and select any dataset. We are going to select Opportunity here. Click on it and it will open a window to create a lens. Switch to SAQL Mode by clicking on the icon in the top-right corner. In the Query editor you will see the following code: q = load "opportunity"; q = group q by all; q = foreach q generate count() as 'count'; q = limit q 2000; You can see the result of this query just below the Query editor: 4. Now replace the third statement with the following statement: q = foreach q generate sum('Amount') as 'Sum Amount'; 5. Click on the Run Query button and observe the result as shown in the following screenshot: Using grouping in SAQL The user can group records of the same value in one group by using the group statements. Use the following syntax: q = group rows by fieldName Let's see how to use grouping in SAQL by performing the following steps: Replace the second and third statement with the following statement: q = group q by 'StageName'; q = foreach q generate 'StageName' as 'StageName', sum('Amount') as 'Sum Amount'; 2. Click on the Run Query button and you should see the following result: Using filters in SAQL Filters in SAQL behave just like a where clause in SOQL and SQL, filtering the data as per the condition or clause. In Einstein Analytics, it selects the row from the dataset that satisfies the condition added. The syntax for the filter is as follows: q = filter q by fieldName 'Operator' value Click on Run Query and view the result as shown in the following screenshot: Using functions in SAQL The beauty of a function is in its reusability. Once the function is created it can be used multiple times. In SAQL, we can use different types of functions, such as string functions, math functions, aggregate functions, windowing functions, and so on. These functions are predefined and saved quite a few times. Let's use a math function power. The syntax for the power is power(m, n). The function returns the value of m raised to the nth power. Replace the following statement with the fourth statement: q = foreach q generate 'StageName' as 'StageName', power(sum('Amount'), 1/2) as 'Amount Squareroot', sum('Amount') as 'Sum Amount'; Click on the Run Query button. We saw how to apply different kinds of case-specific functions in Salesforce Einstein to play with data in order to get the desired outcome. [box type="note" align="" class="" width=""]The above excerpt is taken from the book Learning Einstein Analytics, written by Santosh Chitalkar. It covers techniques to set-up and create apps, lenses, and dashboards using Salesforce Einstein Analytics for effective business insights. If you want to know more about these techniques, check out the book Learning Einstein Analytics.[/box]

0
1
10039

Packt

12 Jul 2010

10 min read

Understanding ShapeSheet™ in Microsoft Visio 2010

Packt

12 Jul 2010

10 min read

In this article by David J. Parker, author of Microsoft Visio 2010 Business Process Diagramming and Validation, we will discuss Microsoft Visio ShapeSheet™ and the key sections, rows, and cells, along with the functions available for writing ShapeSheet™ formulae, where relevant for structured diagrams. Microsoft Visio is a unique data diagramming system, and most of that uniqueness is due to the power of the ShapeSheet, which is a window on the Visio object model. It is the ShapeSheet that enables you to encapsulate complex behavior into apparently simple shapes by adding formulae to the cells using functions. The ShapeSheet was modeled on a spreadsheet, and formulae are entered in a similar manner to cells in an Excel worksheet. Validation rules are written as quasi-ShapeSheet formulae so you will need to understand how they are written. Validation rules can check the contents of ShapeSheet cells, in addition to verifying the structure of a diagram. Therefore, in this article you will learn about the structure of the ShapeSheet and how to write formulae. Where is the ShapeSheet? There is a ShapeSheet behind every single Document, Page, and Shape, and the easiest way to access the ShapeSheet window is to run Visio in Developer mode. This mode adds the Developer tab to the Fluent UI, which has a Show ShapeSheet button. The drop-down list on the button allows you to choose which ShapeSheet window to open. Alternatively, you can use the right-mouse menu of a shape or page, or on the relevant level within the Drawing Explorer window as shown in the following screenshot: The ShapeSheet window, opened by the Show ShapeSheet menu option, displays the requested sections, rows, and cells of the item selected when the window was opened. It does not automatically change to display the contents of any subsequently selected shape in the Visio drawing page—you must open the ShapeSheet window again to do that. The ShapeSheet Tools tab, which is displayed when the ShapeSheet window is active, has a Sections button on the View group to allow you to vary the requested sections on display. You can also open the View Sections dialog from the right-mouse menu within the ShapeSheet window. You cannot alter the display order of sections in the ShapeSheet window, but you can expand/collapse them by clicking the section header. The syntax for referencing the shape, page, and document objects in ShapeSheet formula is listed in the following table. Object ShapeSheet formula Comment Shape Sheet.n! Where n is the ID of the shape Can be omitted when referring to cells in the same shape. Page.PageSheet ThePage! Used in the ShapeSheet formula of shapes within the page. Pages[page name]! Used in the ShapeSheet formula of shapes in other pages. Document.DocumentSheet TheDoc! Used in the ShapeSheet formula in pages or shapes of the document. What are sections, rows, and cells? There are a finite number of sections in a ShapeSheet, and some sections are mandatory for the type of element they are, whilst others are optional. For example, the Shape Transform section, which specifies the shape's size (that is, angle and position) exists for all types of shapes. However, the 1-D Endpoints section, which specifies the co-ordinates of either end of the line, is only relevant, and thus displayed for OneD shapes. Neither of these sections is optional, because they are required for the specific type of shape. Sections like User-defined Cells and Shape Data are optional and they may be added to the ShapeSheet if they do not exist already. If you press the Insert button on the ShapeSheet Tools tab, under the Sections group, then you can see a list of the sections that you may insert into the selected ShapeSheet. In the above example, User-defined Cells option is grayed out because this optional section already exists. It is possible for a shape to have multiple Geometry, Ellipse, or Infinite line sections. In fact, a shape can have a total of 139 of them. Reading a cell's properties If you select a cell in the ShapeSheet, then you will see the formula in the formula edit bar immediately below the ribbon. Move the mouse over the image to enlarge it. You can view the ShapeSheet Formulas (and I thought the plural was formulae!) or Values by clicking the relevant button in the View group on the ShapeSheet Tools ribbon. Notice that Visio provides IntelliSense when editing formulae. This is new in Visio 2010, and is a great help to all ShapeSheet developers. Also notice that the contents of some of the cells are shown in blue text, whilst others are black. This is because the blue text denotes that the values are stored locally with this shape instance, whilst the black text refers to values that are stored in the Master shape. Usually, the more black text you see, the more memory efficient the shape is, since less is needed to be stored with the shape instance. Of course, there are times when you cannot avoid storing values locally, such as the PinX and PinY values in the above screenshot, since these define where the shape instance is in the page. The following VBA code returns 0 (False): ActivePage.Shapes("Task").Cells("PinX").IsInherited But the following code returns -1 (True) : ActivePage.Shapes("Task").Cells("Width").IsInherited The Edit Formula button opens a dialog to enable you to edit multiple lines, since the edit formula bar only displays a single line, and some formulae can be quite large. You can display the Formula Tracing window using the Show Window button in the Formula Tracing group on the ShapeSheet Tools present in Design tab. You can decide whether to Trace Dependents, which displays other cells that have a formula that refers to the selected cell or Trace Precedents, which displays other cells that the formula in this cell refers to. Of course, this can be done in code too. For example, the following VBA code will print out the selected cell in a ShapeSheet into the Immediate Window: Public Sub DebugPrintCellProperties ()'Abort if ShapeSheet not selected in the Visio UI If Not Visio.ActiveWindow.Type = Visio.VisWinTypes.visSheet Then Exit Sub End IfDim cel As Visio.Cell Set cel = Visio.ActiveWindow.SelectedCell'Print out some of the cell properties Debug.Print "Section", cel.Section Debug.Print "Row", cel.Row Debug.Print "Column", cel.Column Debug.Print "Name", cel.Name Debug.Print "FormulaU", cel.FormulaU Debug.Print "ResultIU", cel.ResultIU Debug.Print "ResultStr("""")", cel.ResultStr("") Debug.Print "Dependents", UBound(cel.Dependents)'cel.Precedents may cause an errorOn Error Resume Next Debug.Print "Precedents", UBound(cel.Precedents) End Sub In the previous screenshot, where the Actions.SetDefaultSize.Action cell is selected in the Task shape from the BPMN Basic Shapes stencil, the DebugPrintCellProperties macro outputs the following: Section 240 Row 2 Column 3 Name Actions.SetDefaultSize.Action FormulaU SETF(GetRef(Width),User.DefaultWidth)+SETF(GetRef(Height),User.DefaultHeight) ResultIU 0 ResultStr("") 0.0000 Dependents 0 Precedents 4 Firstly, any cell can be referred to by either its name, or section/row/column indices, commonly referred to as SRC. Secondly, the FormulaU should produce a ResultIU of 0, if the formula is correctly formed and there is no numerical output from it. Thirdly, the Precedents and Dependents are actually an array of referenced cells. Can I print out the ShapeSheet settings? You can download and install the Microsoft Visio SDK from the Visio Developer Center (visit http://msdn.microsoft.com/en-us/office/aa905478.aspx). This will install an extra group, Visio SDK, on the Developer ribbon and one extra button Print ShapeSheet. I have chosen the Clipboard option and pasted the report into an Excel worksheet, as in the following screenshot: The output displays the cell name, value, and formula in each section, in an extremely verbose manner. This makes for many rows in the worksheet, and a varying number of columns in each section. What is a function? A function defines a discrete action, and most functions take a number of arguments as input. Some functions produce an output as a value in the cell that contains the formula, whilst others redirect the output to another cell, and some do not produce a useful output at all. The Developer ShapeSheet Reference in the Visio SDK contains a description of each of the 197 functions available in Visio 2010, and there are some more that are reserved for use by Visio itself. Formulae can be entered into any cell, but some cells will be updated by the Visio engine or by specific add-ons, thus overwriting any formula that may be within the cell. Formulae are entered starting with the = (equals) sign, just as in Excel cells, so that Visio can understand that a formula is being entered rather than just a text. Some cells have been primed to expect text (strings) and will automatically prefix what you type with =" (equals double-quote) and close with "(double-quote) if you do not start typing with an equal sign. For example, the function NOW(), returns the current date time value, which you can modify by applying a format, say, =FORMAT(NOW(),"dd//MM/YYYY"). In fact, the NOW() function will evaluate every minute unless you specify that it only updates at a specific event. You could, for example, cause the formula to be evaluated only when the shape is moved, by adding the DEPENDSON() function: =DEPENDSON(PinX,PinY)+NOW() The normal user will not see the result of any values unless there is something changing in the UI. This could be a value in the Shape Data that could cause linked Data Graphics to change. Or there could be something more subtle, such as the display of some geometry within the shape, like the Compensation symbol in the BPMN Task shape. In the above example, you can see that the Compensation right-mouse menu option is checked, and the IsForCompensation Shape Data value is TRUE. These values are linked, and the Task shape itself displays the two triangles at the bottom edge. The custom right-mouse menu options are defined in the Actions section of the shape's ShapeSheet, and one of the cells, Checked, holds a formula to determine if a tick should be displayed or not. In this case, the Actions.Compensation.Checked cell contains the following formula, which is merely a cell reference: =Prop.BpmnIsForCompensation Prop is the prefix used for all cells in the Shape Data section because this section used to be known as Custom Properties. The Prop.BpmnIsForCompensation row is defined as a Boolean (True/False) Type, so the returned value is going to be 1 or 0 (True or False). Thus, if you were to build a validation rule that required a Task to be for Compensation, then you would have to check this value. You will often need to branch expressions using the following: IF(logical_expression, value_if_true, value_if_false)

0
0
10032

article-image-implementing-autoencoders-using-h2o

Amey Varangaonkar

27 Oct 2017

4 min read

Implementing Autoencoders using H2O

Amey Varangaonkar

27 Oct 2017

4 min read

[box type="note" align="" class="" width=""]This excerpt is taken from the book Neural Networks with R, Chapter 7, Use Cases of Neural Networks - Advanced Topics, written by Giuseppe Ciaburro and Balaji Venkateswaran. In this article, we see how R is an effective tool for neural network modelling, by implementing autoencoders using the popular H2O library.[/box] An autoencoder is an ANN used for learning without efficient coding control. The purpose of an autoencoder is to learn coding for a set of data, typically to reduce dimensionality. Architecturally, the simplest form of autoencoder is an advanced and non-recurring neural network very similar to the MLP, with an input level, an output layer, and one or more hidden layers that connect them, but with the layer outputs having the same number of input level nodes for rebuilding their inputs. In this section, we present an example of implementing Autoencoders using H2O on a movie dataset. The dataset used in this example is a set of movies and genre taken from https://grouplens.org/datasets/movielens We use the movies.csv file, which has three columns: movieId title genres There are 164,979 rows of data for clustering. We will use h2o.deeplearning to have the autoencoder parameter fix the clusters. The objective of the exercise is to cluster the movies based on genre, which can then be used to recommend similar movies or same genre movies to the users. The program uses h20.deeplearning, with the autoencoder parameter set to T: library("h2o") setwd ("c://R") #Load the training dataset of movies movies=read.csv ( "movies.csv", header=TRUE) head(movies) model=h2o.deeplearning(2:3, training_frame=as.h2o(movies), hidden=c(2), autoencoder = T, activation="Tanh") summary(model) features=h2o.deepfeatures(model, as.h2o(movies), layer=1) d=as.matrix(features[1:10,]) labels=as.vector(movies[1:10,2]) plot(d,pch=17) text(d,labels,pos=3) Now, let's go through the code: library("h2o") setwd ("c://R") These commands load the library in the R environment and set the working directory where we will have inserted the dataset for the next reading. Then we load the data: movies=read.csv( "movies.csv", header=TRUE) To visualize the type of data contained in the dataset, we analyze a preview of one of these variables: head(movies) The following figure shows the first 20 rows of the movie dataset: Now we build and train model: model=h2o.deeplearning(2:3, training_frame=as.h2o(movies), hidden=c(2), autoencoder = T, activation="Tanh") Let's analyze some of the information contained in model: summary(model) This is an extract from the results of the summary() function: In the next command, we use the h2o.deepfeatures() function to extract the nonlinear feature from an h2o dataset using an H2O deep learning model: features=h2o.deepfeatures(model, as.h2o(movies), layer=1) In the following code, the first six rows of the features extracted from the model are shown: > features DF.L1.C1 DF.L1.C2 1 0.2569208 -0.2837829 2 0.3437048 -0.2670669 3 0.2969089 -0.4235294 4 0.3214868 -0.3093819 5 0.5586608 0.5829145 6 0.2479671 -0.2757966 [9125 rows x 2 columns] Finally, we plot a diagram where we want to see how the model grouped the movies through the results obtained from the analysis: d=as.matrix(features[1:10,]) labels=as.vector(movies[1:10,2]) plot(d,pch=17) text(d,labels,pos=3) The plot of the movies, once clustering is done, is shown next. We have plotted only 100 movie titles due to space issues. We can see some movies being closely placed, meaning they are of the same genre. The titles are clustered based on distances between them, based on genre. Given a large number of titles, the movie names cannot be distinguished, but what appears to be clear is that the model has grouped the movies into three distinct groups. If you found this excerpt useful, make sure you check out the book Neural Networks with R, containing an interesting coverage of many such useful and insightful topics.

0
1
10027

article-image-rendering-stereoscopic-3d-models-using-opengl

Packt

24 Aug 2015

8 min read

Rendering Stereoscopic 3D Models using OpenGL

Packt

24 Aug 2015

8 min read

In this article, by Raymond C. H. Lo and William C. Y. Lo, authors of the book OpenGL Data Visualization Cookbook, we will demonstrate how to visualize data with stunning stereoscopic 3D technology using OpenGL. Stereoscopic 3D devices are becoming increasingly popular, and the latest generation's wearable computing devices (such as the 3D vision glasses from NVIDIA, Epson, and more recently, the augmented reality 3D glasses from Meta) can now support this feature natively. The ability to visualize data in a stereoscopic 3D environment provides a powerful and highly intuitive platform for the interactive display of data in many applications. For example, we may acquire data from the 3D scan of a model (such as in architecture, engineering, and dentistry or medicine) and would like to visualize or manipulate 3D objects in real time. Unfortunately, OpenGL does not provide any mechanism to load, save, or manipulate 3D models. Thus, to support this, we will integrate a new library named Open Asset Import Library (Assimp) into our code. The main dependencies include the GLFW library that requires OpenGL version 3.2 and higher. (For more resources related to this topic, see here.) Stereoscopic 3D rendering 3D television and 3D glasses are becoming much more prevalent with the latest trends in consumer electronics and technological advances in wearable computing. In the market, there are currently many hardware options that allow us to visualize information with stereoscopic 3D technology. One common format is side-by-side 3D, which is supported by many 3D glasses as each eye sees an image of the same scene from a different perspective. In OpenGL, creating side-by-side 3D rendering requires asymmetric adjustment as well as viewport adjustment (that is, the area to be rendered) – asymmetric frustum parallel projection or equivalently to lens-shift in photography. This technique introduces no vertical parallax and widely adopted in the stereoscopic rendering. To illustrate this concept, the following diagram shows the geometry of the scene that a user sees from the right eye: The intraocular distance (IOD) is the distance between two eyes. As we can see from the diagram, the Frustum Shift represents the amount of skew/shift for asymmetric frustrum adjustment. Similarly, for the left eye image, we perform the transformation with a mirrored setting. The implementation of this setup is described in the next section. How to do it... The following code illustrates the steps to construct the projection and view matrices for stereoscopic 3D visualization. The code uses the intraocular distance, the distance of the image plane, and the distance of the near clipping plane to compute the appropriate frustum shifts value. In the source file, common/controls.cpp, we add the implementation for the stereo 3D matrix setup: void computeStereoViewProjectionMatrices(GLFWwindow* window, float IOD, float depthZ, bool left_eye){ int width, height; glfwGetWindowSize(window, &width, &height); //up vector glm::vec3 up = glm::vec3(0,-1,0); glm::vec3 direction_z(0, 0, -1); //mirror the parameters with the right eye float left_right_direction = -1.0f; if(left_eye) left_right_direction = 1.0f; float aspect_ratio = (float)width/(float)height; float nearZ = 1.0f; float farZ = 100.0f; double frustumshift = (IOD/2)*nearZ/depthZ; float top = tan(g_initial_fov/2)*nearZ; float right = aspect_ratio*top+frustumshift*left_right_direction; //half screen float left = -aspect_ratio*top+frustumshift*left_right_direction; float bottom = -top; g_projection_matrix = glm::frustum(left, right, bottom, top, nearZ, farZ); // update the view matrix g_view_matrix = glm::lookAt( g_position-direction_z+ glm::vec3(left_right_direction*IOD/2, 0, 0), //eye position g_position+ glm::vec3(left_right_direction*IOD/2, 0, 0), //centre position up //up direction ); In the rendering loop in main.cpp, we define the viewports for each eye (left and right) and set up the projection and view matrices accordingly. For each eye, we translate our camera position by half of the intraocular distance, as illustrated in the previous figure: if(stereo){ //draw the LEFT eye, left half of the screen glViewport(0, 0, width/2, height); //computes the MVP matrix from the IOD and virtual image plane distance computeStereoViewProjectionMatrices(g_window, IOD, depthZ, true); //gets the View and Model Matrix and apply to the rendering glm::mat4 projection_matrix = getProjectionMatrix(); glm::mat4 view_matrix = getViewMatrix(); glm::mat4 model_matrix = glm::mat4(1.0); model_matrix = glm::translate(model_matrix, glm::vec3(0.0f, 0.0f, -depthZ)); model_matrix = glm::rotate(model_matrix, glm::pi<float>() * rotateY, glm::vec3(0.0f, 1.0f, 0.0f)); model_matrix = glm::rotate(model_matrix, glm::pi<float>() * rotateX, glm::vec3(1.0f, 0.0f, 0.0f)); glm::mat4 mvp = projection_matrix * view_matrix * model_matrix; //sends our transformation to the currently bound shader, //in the "MVP" uniform variable glUniformMatrix4fv(matrix_id, 1, GL_FALSE, &mvp[0][0]); //render scene, with different drawing modes if(drawTriangles) obj_loader->draw(GL_TRIANGLES); if(drawPoints) obj_loader->draw(GL_POINTS); if(drawLines) obj_loader->draw(GL_LINES); //Draw the RIGHT eye, right half of the screen glViewport(width/2, 0, width/2, height); computeStereoViewProjectionMatrices(g_window, IOD, depthZ, false); projection_matrix = getProjectionMatrix(); view_matrix = getViewMatrix(); model_matrix = glm::mat4(1.0); model_matrix = glm::translate(model_matrix, glm::vec3(0.0f, 0.0f, -depthZ)); model_matrix = glm::rotate(model_matrix, glm::pi<float>() * rotateY, glm::vec3(0.0f, 1.0f, 0.0f)); model_matrix = glm::rotate(model_matrix, glm::pi<float>() * rotateX, glm::vec3(1.0f, 0.0f, 0.0f)); mvp = projection_matrix * view_matrix * model_matrix; glUniformMatrix4fv(matrix_id, 1, GL_FALSE, &mvp[0][0]); if(drawTriangles) obj_loader->draw(GL_TRIANGLES); if(drawPoints) obj_loader->draw(GL_POINTS); if(drawLines) obj_loader->draw(GL_LINES); } The final rendering result consists of two separate images on each side of the display, and note that each image is compressed horizontally by a scaling factor of two. For some display systems, each side of the display is required to preserve the same aspect ratio depending on the specifications of the display. Here are the final screenshots of the same models in true 3D using stereoscopic 3D rendering: Here's the rendering of the architectural model in stereoscopic 3D: How it works... The stereoscopic 3D rendering technique is based on the parallel axis and asymmetric frustum perspective projection principle. In simpler terms, we rendered a separate image for each eye as if the object was seen at a different eye position but viewed on the same plane. Parameters such as the intraocular distance and frustum shift can be dynamically adjusted to provide the desired 3D stereo effects. For example, by increasing or decreasing the frustum asymmetry parameter, the object will appear to be moved in front or behind the plane of the screen. By default, the zero parallax plane is set to the middle of the view volume. That is, the object is set up so that the center position of the object is positioned at the screen level, and some parts of the object will appear in front of or behind the screen. By increasing the frustum asymmetry (that is, positive parallax), the scene will appear to be pushed behind the screen. Likewise, by decreasing the frustum asymmetry (that is, negative parallax), the scene will appear to be pulled in front of the screen. The glm::frustum function sets up the projection matrix, and we implemented the asymmetric frustum projection concept illustrated in the drawing. Then, we use the glm::lookAt function to adjust the eye position based on the IOP value we have selected. To project the images side by side, we use the glViewport function to constrain the area within which the graphics can be rendered. The function basically performs an affine transformation (that is, scale and translation) which maps the normalized device coordinate to the window coordinate. Note that the final result is a side-by-side image in which the graphic is scaled by a factor of two vertically (or compressed horizontally). Depending on the hardware configuration, we may need to adjust the aspect ratio. The current implementation supports side-by-side 3D, which is commonly used in most wearable Augmented Reality (AR) or Virtual Reality (VR) glasses. Fundamentally, the rendering technique, namely the asymmetric frustum perspective projection described in our article, is platform-independent. For example, we have successfully tested our implementation on the Meta 1 Developer Kit (https://www.getameta.com/products) and rendered the final results on the optical see-through stereoscopic 3D display: Here is the front view of the Meta 1 Developer Kit, showing the optical see-through stereoscopic 3D display and 3D range-sensing camera: The result is shown as follows, with the stereoscopic 3D graphics rendered onto the real world (which forms the basis of augmented reality): See also In addition, we can easily extend our code to support shutter glasses-based 3D monitors by utilizing the Quad Buffered OpenGL APIs (refer to the GL_BACK_RIGHT and GL_BACK_LEFT flags in the glDrawBuffer function). Unfortunately, such 3D formats require specific hardware synchronization and often require higher frame rate display (for example, 120Hz) as well as a professional graphics card. Further information on how to implement stereoscopic 3D in your application can be found at http://www.nvidia.com/content/GTC-2010/pdfs/2010_GTC2010.pdf. Summary In this article, we covered how to visualize data with stunning stereoscopic 3D technology using OpenGL. OpenGL does not provide any mechanism to load, save, or manipulate 3D models. Thus, to support this, we have integrated a new library named Assimp into the code. Resources for Article: Further resources on this subject: Organizing a Virtual Filesystem [article] Using OpenCL [article] Introduction to Modern OpenGL [article]

0
1
10012

article-image-using-logistic-regression-predict-market-direction-algorithmic-trading

Richa Tripathi

14 Feb 2018

9 min read

Using Logistic regression to predict market direction in algorithmic trading

Richa Tripathi

14 Feb 2018

9 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book by Dr. Param Jeet and Prashant Vats titled Learning Quantitative Finance with R. This book will help you learn about various algorithmic trading techniques and ways to optimize them using the tools available in R.[/box] In this tutorial we will learn how logistic regression is used to forecast market direction. Market direction is very important for investors or traders. Predicting market direction is quite a challenging task as market data involves lots of noise. The market moves either upward or downward and the nature of market movement is binary. A logistic regression model help us to fit a model using binary behavior and forecast market direction. Logistic regression is one of the probabilistic models which assigns probability to each event. We are going to use the quantmod package. The next three commands are used for loading the package into the workspace, importing data into R from the yahoo repository and extracting only the closing price from the data: >library("quantmod") >getSymbols("^DJI",src="yahoo") >dji<- DJI[,"DJI.Close"] The input data to the logistic regression is constructed using different indicators, such as moving average, standard deviation, RSI, MACD, Bollinger Bands, and so on, which has some predictive power in market direction, that is, Up or Down. These indicators can be constructed using the following commands: >avg10<- rollapply(dji,10,mean) >avg20<- rollapply(dji,20,mean) >std10<- rollapply(dji,10,sd) >std20<- rollapply(dji,20,sd) >rsi5<- RSI(dji,5,"SMA") >rsi14<- RSI(dji,14,"SMA") >macd12269<- MACD(dji,12,26,9,"SMA") >macd7205<- MACD(dji,7,20,5,"SMA") >bbands<- BBands(dji,20,"SMA",2) The following commands are to create variable direction with either Up direction (1) or Down direction (0). Up direction is created when the current price is greater than the 20 days previous price and Down direction is created when the current price is less than the 20 days previous price: >direction<- NULL >direction[dji> Lag(dji,20)] <- 1 >direction[dji< Lag(dji,20)] <- 0 Now we have to bind all columns consisting of price and indicators, which is shown in the following command: >dji<- cbind(dji,avg10,avg20,std10,std20,rsi5,rsi14,macd12269,macd7205,bbands,dire ction) The dimension of the dji object can be calculated using dim(). I used dim() over dji and saved the output in dm(). dm() has two values stored: the first value is the number of rows and the second value is the number of columns in dji. Column names can be extracted using colnames(). The third command is used to extract the name for the last column. Next I replaced the column name with a particular name, Direction: >dm<- dim(dji) >dm [1] 2493 16 >colnames(dji)[dm[2]] [1] "..11" >colnames(dji)[dm[2]] <- "Direction" >colnames(dji)[dm[2]] [1] "Direction" We have extracted the Dow Jones Index (DJI) data into the R workspace. Now, to implement logistic regression, we should divide the data into two parts. The first part is in- sample data and the second part is out-sample data. In-sample data is used for the model building process and out-sample data is used for evaluation purposes. This process also helps to control the variance and bias in the model. The next four lines are for in-sample start, in-sample end, out-sample start, and out-sample end dates: >issd<- "2010-01-01" >ised<- "2014-12-31" >ossd<- "2015-01-01" >osed<- "2015-12-31" The following two commands are to get the row number for the dates, that is, the variable isrow extracts row numbers for the in-sample date range and osrow extracts the row numbers for the out-sample date range: >isrow<- which(index(dji) >= issd& index(dji) <= ised) >osrow<- which(index(dji) >= ossd& index(dji) <= osed) The variables isdji and osdji are the in-sample and out-sample datasets respectively: >isdji<- dji[isrow,] >osdji<- dji[osrow,] If you look at the in-sample data, that is, isdji, you will realize that the scaling of each column is different: a few columns are in the scale of 100, a few others are in the scale of 10,000, and a few others are in the scale of 1. Difference in scaling can put your results in trouble as higher weights are being assigned to higher scaled variables. So before moving ahead, you should consider standardizing the dataset. I will use the following formula: standardized data = The mean and standard deviation of each column using apply() can be seen here: >isme<- apply(isdji,2,mean) >isstd<- apply(isdji,2,sd) An identity matrix of dimension equal to the in-sample data is generated using the following command, which is going to be used for normalization: >isidn<- matrix(1,dim(isdji)[1],dim(isdji)[2]) Use formula 6.1 to standardize the data: >norm_isdji<- (isdji - t(isme*t(isidn))) / t(isstd*t(isidn)) The preceding line also standardizes the direction column, that is, the last column. We don't want direction to be standardized so I replace the last column again with variable direction for the in-sample data range: >dm<- dim(isdji) >norm_isdji[,dm[2]] <- direction[isrow] Now we have created all the data required for model building. You should build a logistic regression model and it will help you to predict market direction based on in-sample data. First, in this step, I created a formula which has direction as dependent and all other columns as independent variables. Then I used a generalized linear model, that is, glm(), to fit a model which has formula, family, and dataset: >formula<- paste("Direction ~ .",sep="") >model<- glm(formula,family="binomial",norm_isdji) A summary of the model can be viewed using the following command: >summary(model) Next use predict() to fit values on the same dataset to estimate the best fitted value: >pred<- predict(model,norm_isdji) Once you have fitted the values, you should try to convert it to probability using the following command. This will convert the output into probabilistic form and the output will be in the range [0,1]: >prob<- 1 / (1+exp(-(pred))) The figure shown below is plotted using the following commands. The first line of the code shows that we divide the figure into two rows and one column, where the first figure is for prediction of the model and the second figure is for probability: >par(mfrow=c(2,1)) >plot(pred,type="l") >plot(prob,type="l") head() can be used to look at the first few values of the variable: >head(prob) 2010-01-042010-01-05 2010-01-06 2010-01-07 0.8019197 0.4610468 0.7397603 0.9821293 The following figure shows the above-defined variable pred, which is a real number, and its conversion between 0 and 1, which represents probability, that is, prob, using the preceding transformation: Figure 6.1: Prediction and probability distribution of DJI As probabilities are in the range of (0,1) so is our vector prob. Now, to classify them as one of the two classes, I considered Up direction (1) when prob is greater than 0.5 and Down direction (0) when prob is less than 0.5. This assignment can be done using the following commands. prob> 0.5 generate true for points where it is greater and pred_direction[prob> 0.5] assigns 1 to all such points. Similarly, the next statement shows assignment 0 when probability is less than or equal to 0.5: >pred_direction<- NULL >pred_direction[prob> 0.5] <- 1 >pred_direction[prob<= 0.5] <- 0 Once we have figured out the predicted direction, we should check model accuracy: how much our model has predicted Up direction as Up direction and Down as Down. There might be some scenarios where it predicted the opposite of what it is, such as predicting down when it is actually Up and vice versa. We can use the caret package to calculate confusionMatrix(), which gives a matrix as an output. All diagonal elements are correctly predicted and off-diagonal elements are errors or wrongly predicted. One should aim to reduce the off-diagonal elements in a confusion matrix: >install.packages('caret') >library(caret) >matrix<- confusionMatrix(pred_direction,norm_isdji$Direction) >matrix Confusion Matrix and Statistics Reference Prediction 0 1 0 362 35 1 42 819 Accuracy : 0.9388 95% CI : (0.9241, 0.9514) No Information Rate : 0.6789 P-Value [Acc>NIR] : <2e-16 Kappa : 0.859 Mcnemar's Test P-Value : 0.4941 Sensitivity : 0.8960 Specificity : 0.9590 PosPredValue : 0.9118 NegPred Value : 0.9512 Prevalence : 0.3211 Detection Rate : 0.2878 Detection Prevalence : 0.3156 Balanced Accuracy : 0.9275 The preceding table shows we have got 94% correct prediction, as 362+819 = 1181 are correct predictions out of 1258 (sum of all four values). Prediction above 80% over in-sample data is generally assumed good prediction; however, 80% is not fixed, one has to figure out this value based on the dataset and industry. Now you have implemented the logistic regression model, which has predicted 94% correctly, and need to test it for generalization power. One should test this model using out-sample data and test its accuracy. The first step is to standardize the out-sample data using formula (6.1). Here mean and standard deviations should be the same as those used for in-sample normalization: >osidn<- matrix(1,dim(osdji)[1],dim(osdji)[2]) >norm_osdji<- (osdji - t(isme*t(osidn))) / t(isstd*t(osidn)) >norm_osdji[,dm[2]] <- direction[osrow] Next we use predict() on the out-sample data and use this value to calculate probability: >ospred<- predict(model,norm_osdji) >osprob<- 1 / (1+exp(-(ospred))) Once probabilities are determined for the out-sample data, you should put it into either Up or Down classes using the following commands. ConfusionMatrix() here will generate a matrix for the out-sample data: >ospred_direction<- NULL >ospred_direction[osprob> 0.5] <- 1 >ospred_direction[osprob<= 0.5] <- 0 >osmatrix<- confusionMatrix(ospred_direction,norm_osdji$Direction) >osmatrix Confusion Matrix and Statistics Reference Prediction 0 1 0 115 26 1 12 99 Accuracy : 0.8492 95% CI : (0.7989, 0.891) This shows 85% accuracy on the out-sample data. A realistic trading model also accounts for trading cost and market slippage, which decrease the winning odds significantly. We presented advanced techniques implemented in capital markets and also learned logistic regression model using binary behavior to forecast market direction. If you enjoyed this excerpt, check out the book Learning Quantitative Finance with R to deep dive into the vast world of algorithmic and machine-learning based trading.

0
1
9927

article-image-introduction-clustering-and-unsupervised-learning

Packt

23 Feb 2016

16 min read

Introduction to Clustering and Unsupervised Learning

Packt

23 Feb 2016

16 min read

The act of clustering, or spotting patterns in data, is not much different from spotting patterns in groups of people. In this article, you will learn: The ways clustering tasks differ from the classification tasks How clustering defines a group, and how such groups are identified by k-means, a classic and easy-to-understand clustering algorithm The steps needed to apply clustering to a real-world task of identifying marketing segments among teenage social media users Before jumping into action, we'll begin by taking an in-depth look at exactly what clustering entails. (For more resources related to this topic, see here.) Understanding clustering Clustering is an unsupervised machine learning task that automatically divides the data into clusters, or groups of similar items. It does this without having been told how the groups should look ahead of time. As we may not even know what we're looking for, clustering is used for knowledge discovery rather than prediction. It provides an insight into the natural groupings found within data. Without advance knowledge of what comprises a cluster, how can a computer possibly know where one group ends and another begins? The answer is simple. Clustering is guided by the principle that items inside a cluster should be very similar to each other, but very different from those outside. The definition of similarity might vary across applications, but the basic idea is always the same—group the data so that the related elements are placed together. The resulting clusters can then be used for action. For instance, you might find clustering methods employed in the following applications: Segmenting customers into groups with similar demographics or buying patterns for targeted marketing campaigns Detecting anomalous behavior, such as unauthorized network intrusions, by identifying patterns of use falling outside the known clusters Simplifying extremely large datasets by grouping features with similar values into a smaller number of homogeneous categories Overall, clustering is useful whenever diverse and varied data can be exemplified by a much smaller number of groups. It results in meaningful and actionable data structures that reduce complexity and provide insight into patterns of relationships. Clustering as a machine learning task Clustering is somewhat different from the classification, numeric prediction, and pattern detection tasks we examined so far. In each of these cases, the result is a model that relates features to an outcome or features to other features; conceptually, the model describes the existing patterns within data. In contrast, clustering creates new data. Unlabeled examples are given a cluster label that has been inferred entirely from the relationships within the data. For this reason, you will, sometimes, see the clustering task referred to as unsupervised classification because, in a sense, it classifies unlabeled examples. The catch is that the class labels obtained from an unsupervised classifier are without intrinsic meaning. Clustering will tell you which groups of examples are closely related—for instance, it might return the groups A, B, and C—but it's up to you to apply an actionable and meaningful label. To see how this impacts the clustering task, let's consider a hypothetical example. Suppose you were organizing a conference on the topic of data science. To facilitate professional networking and collaboration, you planned to seat people in groups according to one of three research specialties: computer and/or database science, math and statistics, and machine learning. Unfortunately, after sending out the conference invitations, you realize that you had forgotten to include a survey asking which discipline the attendee would prefer to be seated with. In a stroke of brilliance, you realize that you might be able to infer each scholar's research specialty by examining his or her publication history. To this end, you begin collecting data on the number of articles each attendee published in computer science-related journals and the number of articles published in math or statistics-related journals. Using the data collected for several scholars, you create a scatterplot: As expected, there seems to be a pattern. We might guess that the upper-left corner, which represents people with many computer science publications but few articles on math, could be a cluster of computer scientists. Following this logic, the lower-right corner might be a group of mathematicians. Similarly, the upper-right corner, those with both math and computer science experience, may be machine learning experts. Our groupings were formed visually; we simply identified clusters as closely grouped data points. Yet in spite of the seemingly obvious groupings, we unfortunately have no way to know whether they are truly homogeneous without personally asking each scholar about his/her academic specialty. The labels we applied required us to make qualitative, presumptive judgments about the types of people that would fall into the group. For this reason, you might imagine the cluster labels in uncertain terms, as follows: Rather than defining the group boundaries subjectively, it would be nice to use machine learning to define them objectively. This might provide us with a rule in the form if a scholar has few math publications, then he/she is a computer science expert. Unfortunately, there's a problem with this plan. As we do not have data on the true class value for each point, a supervised learning algorithm would have no ability to learn such a pattern, as it would have no way of knowing what splits would result in homogenous groups. On the other hand, clustering algorithms use a process very similar to what we did by visually inspecting the scatterplot. Using a measure of how closely the examples are related, homogeneous groups can be identified. In the next section, we'll start looking at how clustering algorithms are implemented. This example highlights an interesting application of clustering. If you begin with unlabeled data, you can use clustering to create class labels. From there, you could apply a supervised learner such as decision trees to find the most important predictors of these classes. This is called semi-supervised learning. The k-means clustering algorithm The k-means algorithm is perhaps the most commonly used clustering method. Having been studied for several decades, it serves as the foundation for many more sophisticated clustering techniques. If you understand the simple principles it uses, you will have the knowledge needed to understand nearly any clustering algorithm in use today. Many such methods are listed on the following site, the CRAN Task View for clustering at http://cran.r-project.org/web/views/Cluster.html. As k-means has evolved over time, there are many implementations of the algorithm. One popular approach is described in : Hartigan JA, Wong MA. A k-means clustering algorithm. Applied Statistics. 1979; 28:100-108. Even though clustering methods have advanced since the inception of k-means, this is not to imply that k-means is obsolete. In fact, the method may be more popular now than ever. The following table lists some reasons why k-means is still used widely: Strengths Weaknesses Uses simple principles that can be explained in non-statistical terms Highly flexible, and can be adapted with simple adjustments to address nearly all of its shortcomings Performs well enough under many real-world use cases Not as sophisticated as more modern clustering algorithms Because it uses an element of random chance, it is not guaranteed to find the optimal set of clusters Requires a reasonable guess as to how many clusters naturally exist in the data Not ideal for non-spherical clusters or clusters of widely varying density The k-means algorithm assigns each of the n examples to one of the k clusters, where k is a number that has been determined ahead of time. The goal is to minimize the differences within each cluster and maximize the differences between the clusters. Unless k and n are extremely small, it is not feasible to compute the optimal clusters across all the possible combinations of examples. Instead, the algorithm uses a heuristic process that finds locally optimal solutions. Put simply, this means that it starts with an initial guess for the cluster assignments, and then modifies the assignments slightly to see whether the changes improve the homogeneity within the clusters. We will cover the process in depth shortly, but the algorithm essentially involves two phases. First, it assigns examples to an initial set of k clusters. Then, it updates the assignments by adjusting the cluster boundaries according to the examples that currently fall into the cluster. The process of updating and assigning occurs several times until changes no longer improve the cluster fit. At this point, the process stops and the clusters are finalized. Due to the heuristic nature of k-means, you may end up with somewhat different final results by making only slight changes to the starting conditions. If the results vary dramatically, this could indicate a problem. For instance, the data may not have natural groupings or the value of k has been poorly chosen. With this in mind, it's a good idea to try a cluster analysis more than once to test the robustness of your findings. To see how the process of assigning and updating works in practice, let's revisit the case of the hypothetical data science conference. Though this is a simple example, it will illustrate the basics of how k-means operates under the hood. Using distance to assign and update clusters As with k-NN, k-means treats feature values as coordinates in a multidimensional feature space. For the conference data, there are only two features, so we can represent the feature space as a two-dimensional scatterplot as depicted previously. The k-means algorithm begins by choosing k points in the feature space to serve as the cluster centers. These centers are the catalyst that spurs the remaining examples to fall into place. Often, the points are chosen by selecting k random examples from the training dataset. As we hope to identify three clusters, according to this method, k = 3 points will be selected at random. These points are indicated by the star, triangle, and diamond in the following diagram: It's worth noting that although the three cluster centers in the preceding diagram happen to be widely spaced apart, this is not always necessarily the case. Since they are selected at random, the three centers could have just as easily been three adjacent points. As the k-means algorithm is highly sensitive to the starting position of the cluster centers, this means that random chance may have a substantial impact on the final set of clusters. To address this problem, k-means can be modified to use different methods for choosing the initial centers. For example, one variant chooses random values occurring anywhere in the feature space (rather than only selecting among the values observed in the data). Another option is to skip this step altogether; by randomly assigning each example to a cluster, the algorithm can jump ahead immediately to the update phase. Each of these approaches adds a particular bias to the final set of clusters, which you may be able to use to improve your results. In 2007, an algorithm called k-means++ was introduced, which proposes an alternative method for selecting the initial cluster centers. It purports to be an efficient way to get much closer to the optimal clustering solution while reducing the impact of random chance. For more information, refer to Arthur D, Vassilvitskii S. k-means++: The advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. 2007:1027–1035. After choosing the initial cluster centers, the other examples are assigned to the cluster center that is nearest according to the distance function. You will remember that we studied distance functions while learning about k-Nearest Neighbors. Traditionally, k-means uses Euclidean distance, but Manhattan distance or Minkowski distance are also sometimes used. Recall that if n indicates the number of features, the formula for Euclidean distance between example x and example y is: For instance, if we are comparing a guest with five computer science publications and one math publication to a guest with zero computer science papers and two math papers, we could compute this in R as follows: > sqrt((5 - 0)^2 + (1 - 2)^2) [1] 5.09902 Using this distance function, we find the distance between each example and each cluster center. The example is then assigned to the nearest cluster center. Keep in mind that as we are using distance calculations, all the features need to be numeric, and the values should be normalized to a standard range ahead of time. As shown in the following diagram, the three cluster centers partition the examples into three segments labeled Cluster A, Cluster B, and Cluster C. The dashed lines indicate the boundaries for the Voronoi diagram created by the cluster centers. The Voronoi diagram indicates the areas that are closer to one cluster center than any other; the vertex where all the three boundaries meet is the maximal distance from all three cluster centers. Using these boundaries, we can easily see the regions claimed by each of the initial k-means seeds: Now that the initial assignment phase has been completed, the k-means algorithm proceeds to the update phase. The first step of updating the clusters involves shifting the initial centers to a new location, known as the centroid, which is calculated as the average position of the points currently assigned to that cluster. The following diagram illustrates how as the cluster centers shift to the new centroids, the boundaries in the Voronoi diagram also shift and a point that was once in Cluster B (indicated by an arrow) is added to Cluster A: As a result of this reassignment, the k-means algorithm will continue through another update phase. After shifting the cluster centroids, updating the cluster boundaries, and reassigning points into new clusters (as indicated by arrows), the figure looks like this: Because two more points were reassigned, another update must occur, which moves the centroids and updates the cluster boundaries. However, because these changes result in no reassignments, the k-means algorithm stops. The cluster assignments are now final: The final clusters can be reported in one of the two ways. First, you might simply report the cluster assignments such as A, B, or C for each example. Alternatively, you could report the coordinates of the cluster centroids after the final update. Given either reporting method, you are able to define the cluster boundaries by calculating the centroids or assigning each example to its nearest cluster. Choosing the appropriate number of clusters In the introduction to k-means, we learned that the algorithm is sensitive to the randomly-chosen cluster centers. Indeed, if we had selected a different combination of three starting points in the previous example, we may have found clusters that split the data differently from what we had expected. Similarly, k-means is sensitive to the number of clusters; the choice requires a delicate balance. Setting k to be very large will improve the homogeneity of the clusters, and at the same time, it risks overfitting the data. Ideally, you will have a priori knowledge (a prior belief) about the true groupings and you can apply this information to choosing the number of clusters. For instance, if you were clustering movies, you might begin by setting k equal to the number of genres considered for the Academy Awards. In the data science conference seating problem that we worked through previously, k might reflect the number of academic fields of study that were invited. Sometimes the number of clusters is dictated by business requirements or the motivation for the analysis. For example, the number of tables in the meeting hall could dictate how many groups of people should be created from the data science attendee list. Extending this idea to another business case, if the marketing department only has resources to create three distinct advertising campaigns, it might make sense to set k = 3 to assign all the potential customers to one of the three appeals. Without any prior knowledge, one rule of thumb suggests setting k equal to the square root of (n / 2), where n is the number of examples in the dataset. However, this rule of thumb is likely to result in an unwieldy number of clusters for large datasets. Luckily, there are other statistical methods that can assist in finding a suitable k-means cluster set. A technique known as the elbow method attempts to gauge how the homogeneity or heterogeneity within the clusters changes for various values of k. As illustrated in the following diagrams, the homogeneity within clusters is expected to increase as additional clusters are added; similarly, heterogeneity will also continue to decrease with more clusters. As you could continue to see improvements until each example is in its own cluster, the goal is not to maximize homogeneity or minimize heterogeneity, but rather to find k so that there are diminishing returns beyond that point. This value of k is known as the elbow point because it looks like an elbow. There are numerous statistics to measure homogeneity and heterogeneity within the clusters that can be used with the elbow method (the following information box provides a citation for more detail). Still, in practice, it is not always feasible to iteratively test a large number of k values. This is in part because clustering large datasets can be fairly time consuming; clustering the data repeatedly is even worse. Regardless, applications requiring the exact optimal set of clusters are fairly rare. In most clustering applications, it suffices to choose a k value based on convenience rather than strict performance requirements. For a very thorough review of the vast assortment of cluster performance measures, refer to: Halkidi M, Batistakis Y, Vazirgiannis M. On clustering validation techniques. Journal of Intelligent Information Systems. 2001; 17:107-145. The process of setting k itself can sometimes lead to interesting insights. By observing how the characteristics of the clusters change as k is varied, one might infer where the data have naturally defined boundaries. Groups that are more tightly clustered will change a little, while less homogeneous groups will form and disband over time. In general, it may be wise to spend little time worrying about getting k exactly right. The next example will demonstrate how even a tiny bit of subject-matter knowledge borrowed from a Hollywood film can be used to set k such that actionable and interesting clusters are found. As clustering is unsupervised, the task is really about what you make of it; the value is in the insights you take away from the algorithm's findings. Summary This article covered only the fundamentals of clustering. As a very mature machine learning method, there are many variants of the k-means algorithm as well as many other clustering algorithms that bring unique biases and heuristics to the task. Based on the foundation in this article, you will be able to understand and apply other clustering methods to new problems. To learn more about different machine learning techniques, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Learning Data Mining with R (https://www.packtpub.com/big-data-and-business-intelligence/learning-data-mining-r) Mastering Scientific Computing with R (https://www.packtpub.com/application-development/mastering-scientific-computing-r) R for Data Science (https://www.packtpub.com/big-data-and-business-intelligence/r-data-science) Resources for Article: Further resources on this subject: Displaying SQL Server Data using a Linq Data Source [article] Probability of R? [article] Working with Commands and Plugins [article]

0
0
9925

How-To Tutorials - Data

Getting to know and manipulate Tensors in TensorFlow

6 index types in PostgreSQL 10 you should know

How to perform Numeric Metric Aggregations with Elasticsearch

How to classify emails using deep neural networks after generating TF-IDF

Building Motion Charts with Tableau

How to handle missing data in IBM SPSS Modeler

How to query sharded data in MongoDB

5 reasons government should regulate technology

MS Access Queries with Oracle SQL Developer 1.2 Tool

Learning the Salesforce Analytics Query Language (SAQL)

Trending Topics

Understanding ShapeSheet™ in Microsoft Visio 2010

Implementing Autoencoders using H2O

Rendering Stereoscopic 3D Models using OpenGL

Using Logistic regression to predict market direction in algorithmic trading

Introduction to Clustering and Unsupervised Learning