Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-storage-apache-cassandra
Packt
23 Jan 2017
42 min read
Save for later

The Storage - Apache Cassandra

Packt
23 Jan 2017
42 min read
In this article by Raúl Estrada, the author of the book Fast Data Processing Systems with SMACK Stack we will learn about Apache Cassandra. We have reached the part where we talk about storage. The C in the SMACK stack refers to Cassandra. The reader may wonder; why not use a conventional database? The answer is that Cassandra is the database that propels some giants like Walmart, CERN, Cisco, Facebook, Netflix, and Twitter. Spark uses a lot of Cassandra’s power. The application efficiency is greatly increased using the Spark Cassandra Connector. This article has the following sections: A bit of history NoSQL Apache Cassandra installation Authentication and authorization (roles) Backup and recovery Spark +a connector (For more resources related to this topic, see here.) A bit of history In Greek mythology, there was a priestess who was chastised for her treason againstthe God, Apollo. She asked forthe power of prophecy in exchange for a carnal meeting; however, she failed to fulfill her part of the deal. So, she received a punishment; she would have the power of prophecy, but no one would ever believe her forecasts. This priestess’s name was Cassandra. Movingto more recenttimes, let’s say 50 years ago, in the world of computing there have been big changes. In 1960, the HDD (Hard Disk Drive) took precedence over the magnetic strips which facilitate data handling. In 1966, IBM created the Information Management System (IMS) for the Apollo space program from whose hierarchical models later developed IBM DB2. In 1970s, a model that is fundamentally changing the existing data storage methods appeared, called the relational data model. Devised by Codd as an alternative to IBM’s IMS and its organization mode and data storage in 1985, his work presented 12 rules that a database should meet in order to be considered a relational database. The Web (especially social networks) appeared and demanded the storage oflarge amounts of data. The Relational Database Management System (RDBMS) scales the actual costs of databases, the number of users, amount of data, response time, or the time it takes to make a specific query on a database. In the beginning, it waspossible to solve through vertical scaling: the server machine is upgraded with more RAM, higher processors, and larger and faster HDDs. Now we can mitigate the problem, but it will not disappear. When the same problem occurs again, and the server cannot be upgraded, the only solution is to add a new server, which itself may hide unplanned costs: OS license, Database Management System (DBMS), and so on, without mentioning the data replication, transactions, and data consistency under normal use. One solution of such problems is the use of NoSQL databases. NoSQL was born from the need to process large amounts of data based on large hardware platforms built through clustering servers. The term NoSQL is perhaps not precise. A more appropriate term should be Not Only SQL. It is used on several non-relational databases such as Apache Cassandra, MongoDB, Riak, Neo4J, and so on, which have becomemore widespread in recent years. NoSQL We will read NoSQL as Not only SQL (SQL, Structured Query Language). NoSQL is a distributed database with an emphasis on scalability, high availability, and ease of administration; the opposite of established relational databases. Don’t think it as a direct replacement for RDBMS, rather, an alternative or a complement. The focus is in avoiding unnecessary complexity, the solution for data storage according to today’s needs, and without a fixed scheme. Due its distributed, the cloud computing is a great NoSQL sponsor. A NoSQL database model can be: Key-value/Tuple based For example, Redis, Oracle NoSQL (ACID compliant), Riak, Tokyo Cabinet / Tyrant, Voldemort, Amazon Dynamo, and Memcached and is used by Linked-In, Amazon, BestBuy, Github, and AOL. Wide Row/Column-oriented-based For example, Google BigTable, Apache Cassandra, Hbase/Hypertable, and Amazon SimpleDB and used by Amazon, Google, Facebook, and RealNetworks Document-based For example, CouchDB (ACID compliant), MongoDB, TerraStore, and Lotus Notes (possibly the oldest) and used in various financial and other relevant institutions: the US army, SAP, MTV, and SourceForge Object-based For example, db4o, Versant, Objectivity, and NEO and used by Siemens, China Telecom, and the European Space Agency. Graph-based For example, Neo4J, InfiniteGraph, VertexDb, and FlocDb and used by Twitter, Nortler, Ericson, Qualcomm, and Siemens. XML, multivalue, and others In Table 4-1, we have a comparison ofthe mentioned data models: Model Performance Scalability Flexibility Complexity Functionality key-value high high high low depends column high high high low depends document high high high low depends graph depends depends high high graph theory RDBMS depends depends low moderate relational algebra Table 4-1: Categorization and comparison NoSQL data model of Scofield and Popescu NoSQL or SQL? This is thewrong question. It would be better to ask the question: What do we need? Basically, it all depends on the application’s needs. Nothing is black and white. If consistency is essential, use RDBMS. If we need high-availability, fault tolerance, and scalability then use NoSQL. The recommendation is that in a new project, evaluate the best of each world. It doesn’t make sense to force NoSQL where it doesn’t fit, because its benefits (scalability, read/write speed in entire order of magnitude, soft data model) are only conditioned advantages achieved in a set of problems that can be solved, per se. It is necessary to carefully weigh, beyond marketing, what exactly is needed, what kind of strategy is needed, and how they will be applied to solve our problem. Consider using a NoSQL database only when you decide that this is a better solution than SQL. The challenges for NoSQL databases are: elastic scaling, cost-effective, simple and flexible. In table 4-2, we compare the two models: NoSQL RDBMS Schema-less Relational schema Scalable read/write Scalable read Auto high availability Custom high availability Limited queries Flexible queries Eventual consistency Consistency BASE ACID Table 4-2: Comparison of NoSQL and RDBMS CAP Brewer’s theorem In 2000, in Portland Oregon, the United States held the nineteenth international symposium on principles of distributed computing where keynote speaker Eric Brewer, a professor at UC Berkeley talked. In his presentation, among other things, he said that there are three basic system requirements which have a special relationship when making the design and implementation of applications in a distributed environment, and that a distributed system can have a maximum of two of the three properties (which is the basis of his theorem). The three properties are: Consistency: This property says that the data on one node must be the same data when read from a second node, the second node must show exactly the same data (could be a delay, if someone else in between is performing an update, but not different). Availability: This property says that a failure on one node doesn’t mean the loss of its data; the system must be able to display the requested data. Partition tolerance: This property says that in the event of a breakdown in communication between two nodes, the system should still work, meaning the data will still be available. In Figure 4-1, we show the CAP Brewer’s theorem with some examples.   Figure 4-1 CAP Brewer’s theorem Apache Cassandra installation In the Facebook laboratories, although not visible to the public, new software is developed, for example, the junction between two concepts involving the development departments of Google and Amazon. In short, Cassandra is defined as a distributed database. Since the beginning, the authors took the task of creating a scalable database massively decentralized, optimized for read operations when possible, painlessly modifying data structures, and with all this, not difficult to manage. The solution was found by combining two existing technologies: Google’s BigTable and Amazon’s Dynamo.One of the two authors, A. Lakshman, had earlier worked on BigTable and he borrowed the data model layout, while Dynamo contributed with the overall distributed architecture. Cassandra is written in Java and for good performance it requires the latest possible JDK version. In Cassandra 1.0, they used another open source project Thriftfor client access, which also came from Facebook and is currently an Apache Software project. In Cassandra 2.0, Thrift was removed in favor of CQL. Initially, thrift was not made just for Cassandra, but it is a software library tool and code generator for accessing backend services. Cassandra administration is done with the command-line tools or via the JMX console, the default installation allows us to use additional client tools. Since this is a server cluster, it hasdifferent administration rules and it is always good to review thedocumentation to take advantage of other people’s experiences. Cassandra managed the very demanding taskssuccessfully. Often used on site, serving a huge number of users (such as Twitter, Digg, Facebook, and Cisco) that, relatively, often change their complex data models to meet the challenges that will come later, and usually do not have to dealwith expensive hardware or licenses. At the time of writing, the Cassandra homepage (http://cassandra.apache.org) says that Apple Inc. for example, has a 75000 node cluster storing 10 Petabytes. Data model The storage model of Cassandra could be seen as a sorted HashMap of sorted HashMaps. Cassandra is a database that stores the rows in the form of key-value. In this model, the number of columns is not predefined in advance as in standard relational databases, but a single row can contain several columns. The column (Figure 4-2, Column) is the smallest atomic unit model. Each element in the column consists of a triplet: a name, a value (stored as a series of bytes without regard to the source type), and a timestamp (the time used to determine the most recent record). Figure4-2: Column All data triplets are obtained from the client, and even a timestamp. Thus, the row consists of a key and a set of data triplets (Figure 4-3).Here is how the super column will look: Figure 4-3: Super column In addition, the columns can be grouped into so-called column families (Figure 4-4, Column family), which would be somehow equivalent to the table and can be indexed: Figure 4-4: Column family A higher logical unit is the super column (as shown in the followingFigure 4-5, Super column family), in which columns contain other columns: Figure 4-5: Super column family Above all is the key space (As shown in Figure 4-6, Cluster with Key Spaces), which would be equivalent to a relational schema andis typically used by one application. The data model is simple, but at the same time very flexible and it takes some time to become accustomed to the new way of thinking while rejecting all the SQL’s syntax luxury. The replication factor is unique per keyspace. Moreover, keyspace could span multiple clusters and have different replication factors for each of them. This is used in geo-distributed deployments. Figure 4-6: Cluster with key spaces Data storage Apache Cassandra is designed to process large amounts of data in a short time; this way of storing data is taken from her big brother, Google’s Bigtable. Cassandra has a commit log file in which all the new data is recorded in order to ensure their sustainability. When data is successfully written on the commit log file, the recording of the freshest data is stored in a memory structure called memtable (Cassandra considers a writing failure if the same information is in the commit log and in memtable). Data within memtables issorted by Row key. When memtable is full, its contents are copied to the hard drive in a structure called Sorted String Table (SSTable). The process of copying content from memtable into SSTable is called flush. Data flush is performed periodically, although it could be carried out manually (for example, before restarting a node) through node tool flush commands. The SSTable provides a fixed, sorted map of row and value keys. Data entered in one SSTable cannot be changed, but is possible to enter new data. The internal structure of SSTable consists of a series of blocks of 64Kb (the block size can be changed), internally a SSTable is a block index used to locate blocks. One data row is usually stored within several SSTables so reading a single data row is performed in the background combining SSTables and the memtable (which have not yet made flush). In order to optimize the process of connecting, Cassandra uses a memory structure called Bloomfilter. Every SSTable has a bloom filter that checks if the requested row key is in the SSTable before look up in the disk. In order to reduce row fragmentation through several SSTables, in the background Cassandra performs another process: the compaction, a merge of several SSTables into a single SSTable. Fragmented data iscombined based on the values ​​of a row key. After creating a new SSTable, the old SSTable islabeled as outdated and marked in the garbage collector process for deletion. Compaction has different strategies: size-tiered compaction and leveled compaction and both have their own benefits for different scenarios. Installation To install Cassandra, go to http://www.planetcassandra.org/cassandra/. Installation is simple. After downloading the compressed files, extract them and change a couple of settings in the configuration files (set the new directory path). Run the startup scripts to activate a single node, and the database server. Of course, it is possible to use Cassandra in only one node, but we lose its main power, the distribution. The process of adding new servers to the cluster is called bootstrap and is generally not a difficult operation. Once all the servers are active, they form a ring of nodes, none of which is central meaning without a main server. Within the ring, the information propagation on all servers is performed through a gossip protocol. In short, one node transmits information about the new instances to only some of their known colleagues, and if one of them already knows from other sources about the new node, the first node propagation is stopped. Thus, the information about the node is propagated in an efficient and rapid way through the network. It is necessary for a new node activation to seed its information to at least one existing server in the cluster so the gossip protocol works. The server receives its numeric identifier, and each of the ring nodes stores its data. Which nodes store the information depends on the hash MD5 key-value (a combination of key-value) as shown in Figure 4-7, Nodes within a cluster. Figure 4-7: Nodes within a cluster The nodes are in a circular stack, that is, a ring, and each record is stored on multiple nodes. In case of failure of one of them, the data isstill available. Nodes are occupied according to their identifier integer range, that is, if the calculated value falls into a node range, then the data is saved there. Saving is not performed on only one node, more is better, an operation is considered a success if the data is correctly stored at the most possible nodes. All this is parameterized. In this way, Cassandra achieves sufficient data consistency and provides greater robustness of the entire system, if one node in the ring fails, is always possible to retrieve valid information from the other nodes. In the event that a node comes back online again, it is necessary to synchronize the data on it, which is achieved through the reading operation. The data is read from all the ring servers, a node saves just the data accepted as valid, that is, the most recent data, the data comparison is made according to the timestamp records. The nodes that don’t have the latest information, refresh theirdata in a low priority back-end process. Although this brief description of the architecture makes it sound like it is full of holes, in reality everything works flawlessly. Indeed, more servers in the game implies a better general situation. DataStax OpsCenter In this section, we make the Cassandra installation on a computer with a Windows operating system (to prove that nobody is excluded). Installing software under the Apache open license can be complicated on a Windows computer, especially if it is new software, such as Cassandra. To make things simpler we will use a distribution package for easy installation, start-up and work with Cassandra on a Windows computer. The distribution used in this example is called DataStax Community Edition. DataStax contains Apache Cassandra, along with the Cassandra Query Language (CQL) tool and the free version of DataStax OpsCenter for management and monitoring the Cassandra cluster. We can say that OpsCenter is a kind of DBMS for NoSQL databases. After downloading the installer from the DataStax’s official site, the installation process is quite simple, just keep in mind that DataStax supports Windows 7 and Windows Server 2008 and that DataStax used on a Windows computer must have the Chrome or Firefox web browser (Internet explorer is not supported). When starting DataStax on a Windows computer, DataStax will open asin Figure 4-8, DataStax OpsCenter. Figure 4-8: DataStax OpsCenter DataStax consists of a control panel (dashboard), in which we review the events, performance, and capacity of the cluster and also see how many nodes belong to our cluster (in this case a single node). In cluster control, we can see the different types of views (ring, physical, list). Adding a new key space (the equivalent to creating a database in the classic DBMS) is done through the CQLShell using CQL or using the DataStax data modeling. Also, using the data explorer we can view the column family and the database. Creating a key space The main tool for managing Cassandra CQL runs in a console interface and this tool is used to add new key spaces from which we will create a column family. The key space is created as follows: cqlsh> create keyspace hr with strategy_class=‘SimpleStrategy’ and strategy_options_replication_factor=1; After opening CQL Shell, the command create keyspace will make a new key space, the strategy_class = ‘SimpleStrategy’parameter invokes class replication strategy used when creating new key spaces. Optionally,strategy_options:replication_factor = 1command creates a copy of each row in each cluster node, and the value replication_factor set to 1 produces only one copy of each row on each node (if we set to 2, we will have two copies of each row on each node). cqlsh> use hr; cqlsh:hr> create columnfamily employee (sid int primary key, ... name varchar, ... last_name varchar); There are two types of keyspaces: SimpleStrategy and NetworkTopologyStrategy, whose syntax is as follows: { ‘class’ : ‘SimpleStrategy’, ‘replication_factor’ : <integer> }; { ‘class’ : ‘NetworkTopologyStrategy’[, ‘<data center>‘ : <integer>, ‘<data center>‘ : <integer>] . . . }; When NetworkTopologyStrategyis configured as the replication strategy, we set up one or more virtual data centers. To create a new column family, we use the create command; select the desired Key Space, and with the command create columnfamily example, we create a new table in which we define the id an integer as a primary key and other attributes like name and lastname. To make a data entry in column family, we use the insert command: insert into <table name> (<attribute_1>, < attribute_2> ... < attribute_n>); When filling data tables we use the common SQL syntax: cqlsh:hr>insert into employee (sid, name, lastname) values (1, ‘Raul’, ‘Estrada’); So we enter data values. With the selectcommand we can review our insert: cqlsh:hr> select * from employee; sid | name | last_name ----+------+------------ 1 | Raul | Estrada Authentication and authorization (roles) In Cassandra, the authentication and authorization must be configured on the cassandra.yamlfile and two additional files. The first file is to assign rights to users over the key space and column family, while the second is to assign passwords to users. These files are called access.properties and passwd.properties, and are located in the Cassandra installation directory. These files can be opened using our favorite text editor in order to be successfully configured. Setting up a simple authentication and authorization The following steps are: In the access.properitesfile we add the access rights to users and the permissions to read and write certain key spaces and columnfamily.Syntax: keyspace.columnfamily.permits = users Example 1: hr <rw> = restrada Example 2: hr.cars <ro> = restrada, raparicio In example 1, we give full rights in the Key Space hr to restrada while in example 2 we give read-only rights to users to the column family cars. In the passwd.propertiesfile, user names are matched to passwords, onthe left side of the equal sign we write username and onthe right side the password: Example: restrada = Swordfish01 After we change the files, before restarting Cassandra it is necessary to type the following command in the terminal in order to reflect the changes in the database: $ cd <installation_directory> $ sh bin/cassandra -f -Dpasswd.properties = conf/passwd.properties -Daccess.properties = conf/access.properties Note: The third step of setting up authentication and authorization doesn’t work onWindows computers and is just needed on Linux distributions. Also, note that user authentication and authorization should not be solved through Cassandra, for safety reasons, in the latest Cassandra versions this function is not included. Backup The purpose of making Cassandra a NoSQL database is because when we create a single node, we make a copy of it. Copying the database to other nodes and the exact number of copies depend on the replication factor established when we create a new key space. But as any other standard SQL database, Cassandra offers to create a backup on the local computer. Cassandra creates a copy of the base using snapshot. It is possible to make a snapshot of all the key spaces, or just one column family. It is also possible to make a snapshot of the entire cluster using the parallel SSH tool (pssh). If the user decides to snapshot the entire cluster, it can be reinitiated and use an incremental backup on each node. Incremental backups provide a way to get each node configured separately, through setting the incremental_backupsflagto truein cassandra.yaml. When incremental backups are enabled, Cassandra hard-links each flushed SSTable to a backups directory under the keyspace data directory. This allows storing backups offsite without transferring entire snapshots. To snapshot a key space we use the nodetool command: Syntax: nodetool snapshot -cf <ColumnFamily><keypace> -t <snapshot_name> Example: nodetool snapshot -cf cars hr snapshot1 The snapshot is stored in the Cassandra installation directory: C:Program FilesDataStax Communitydatadataenexamplesnapshots Compression The compression increases the cluster nodes capacity reducing the data size on the disk. With this function, compression also enhances the server’s disk performance. Compression in Cassandra works better when compressing a column family with a lot of columns, when each row has the same columns, or when we have a lot of common columns with the same data. A good example of this is a column family that contains user information such as user name and password because it is possible that they have the same data repeated. As the greater number of the same data to be extended through the rows, the compression ratio higher is. Column family compression is made with the Cassandra-CLI tool. It is possible to update existing columns families or create a new column family with specific compression conditions, for example, the compression shown here: CREATE COLUMN FAMILY users WITH comparator = ‘UTF8Type’ AND key_validation_class = ‘UTF8Type’ AND column_metadata = [ (column_name: name, validation_class: UTF8Type) (column_name: email, validation_class: UTF8Type) (column_name: country, validation_class: UTF8Type) (column_name: birth_date, validation_class: LongType) ] AND compression_options=(sstable_compression:SnappyCompressor, chunk_length_kb:64); We will see this output: Waiting for schema agreement.... ... schemas agree across the cluster After opening the Cassandra-CLI, we need to choose thekey space where the new column family would be. When creating a column family, it is necessary to state that the comparator (UTF8 type) and key_validation_class are of the same type. With this we will ensure that when executing the command we won’t have an exception (generated by a bug). After printing the column names, we set compression_options which has two possible classes: SnappyCompresor that provides faster data compression or DeflateCompresor which provides a higher compression ratio. The chunk_length adjusts compression size in kilobytes. Recovery Recovering a key space snapshot requests all the snapshots made for a certain column family. If you use an incremental backup, it is also necessary to provide the incremental backups created after the snapshot. There are multiple ways to perform a recovery from the snapshot. We can use the SSTable loader tool (used exclusively on the Linux distribution) or can recreate the installation method. Restart node If the recovery is running on one node, we must first shutdown the node. If the recovery is for the entire cluster, it is necessary to restart each node in the cluster. Here is the procedure: Shut down the node Delete all the log files in:C:Program FilesDataStax Communitylogs Delete all .db files within a specified key space and column family:C:Program FilesDataStax Communitydatadataencars Locate all Snapshots related to the column family:C:Program FilesDataStax Communitydatadataencarssnapshots1,351,279,613,842, Copy them to: C:Program FilesDataStax Communitydatadataencars Re-start the node. Printing schema Through DataStax OpsCenter or Apache Cassandra CLI we can obtain the schemes (Key Spaces) with the associated column families, but there is no way to make a data export or print it. Apache Cassandra is not RDBMS and it is not possible to obtain a relational model scheme from the key space database. Logs Apache Cassandra and DataStax OpsCenter both use the Apache log4j logging service API. In the directory where DataStax is installed, under Apache-Cassandra and opsCenter is the conf directory where the file log4j-server.properties is located, log4j-tools.properties for apache-cassandra andlog4j.properties for OpsCenter. The parameters of the log4j file can be modified using a text editor, log files are stored in plain text in the...DataStax Communitylogsdirectory, here it is possible to change the directory location to store the log files. Configuring log4j log4j configuration files are divided into several parts where all the parameters are set to specify how collected data is processed and written in the log files. For RootLoger: # RootLoger level log4j.rootLogger = INFO, stdout, R This section defines the data level, respectively, to all the events recorded in the log file. As we can see in Table 4-3, log level can be: Level Record ALL The lowest level, all the events are recorded in the log file DEBUG Detailed information about events ERROR Information about runtime errors or unexpected events FATAL Critical error information INFO Information about the state of the system OFF The highest level, the log file record is off TRACE Detailed debug information WARN Information about potential adverse events (unwanted/unexpected runtime errors) Table 4-3 Log4J Log level For Standard out stdout: # stdout log4j.appender.stdout = org.apache.log4j.ConsoleAppender log4j.appender.stdout.layout = org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern= %5p %d{HH:mm:ss,SSS} %m%n Through the StandardOutputWriterclass,we define the appearance of the data in the log file. ConsoleAppenderclass is used for entry data in the log file, and theConversionPattern class defines the data appearance written into a log file. In the diagram, we can see how the data looks like stored in a log file, which isdefined by the previous configuration. Log file rotation In this example, we rotate the log when it reaches 20 Mb and we retain just 50 log files. # rolling log file log4j.appender.R=org.apache.log4j.RollingFileAppender log4j.appender.R.maxFileSize=20MB log4j.appender.R.maxBackupIndex=50 log4j.appender.R.layout=org.apache.log4j.PatternLayout log4j.appender.R.layout.ConversionPattern=%5p [%t] %d{ISO8601} %F (line %L) %m%n This part sets the log files. TheRollingFileAppenderclass inherits from FileAppender, and its role is to make a log file backup when it reaches a given size (in this case 20 MB). TheRollingFileAppender class has several methods, these two are the most used: public void setMaxFileSize( String value ) Method to define the file size and can take a value from 0 to 263 using the abbreviations KB, MB, GB.The integer value is automatically converted (in the example, the file size is limited to 20 MB): public void setMaxBackupIndex( int maxBackups ) Method that defines how the backup file is stored before the oldest log file is deleted (in this case retain 50 log files). To set the parameters of the location where the log files will be stored, use: # Edit the next line to point to your logs directory log4j.appender.R.File=C:/Program Files (x86)/DataStax Community/logs/cassandra.log User activity log log4j API has the ability to store user activity logs.In production, it is not recommended to use DEBUG or TRACE log level. Transaction log As mentioned earlier, any new data is stored in the commit log file. Within thecassandra.yaml configuration file, we can set the location where the commit log files will be stored: # commit log commitlog_directory: “C:/Program Files (x86)/DataStax Community/data/commitlog” SQL dump It is not possible to make a database SQL dump, onlysnapshot the DB. CQL CQL is a language like SQL, CQL means Cassandra Query Language.With this language we make the queries on a Key Space. There are several ways to interact with a Key Space, in the previous section we show how to do it using a shell called CQL shell. Since CQL is the first way to interact with Cassandra, in Table 4-4, Shell Command Summary, we see the main commands that can be used on the CQL Shell: Command Description Cqlsh Captures command output and appends it to a file. CAPTURE Shows the current consistency level, or given a level, sets it. CONSISTENCY Imports and exports CSV (comma-separated values) data to and from Cassandra. COPY Provides information about the connected Cassandra cluster, or about the data objects stored in the cluster. DESCRIBE Formats the output of a query vertically. EXPAND Terminates cqlsh. EXIT Enables or disables query paging. PAGING Shows the Cassandra version, host, or tracing information for the current cqlsh client session. SHOW Executes a file containing CQL statements. SOURCE Enables or disables request tracing. TRACING Captures command output and appends it to a file. Table 4-4. Shell command summary For more detailed information of shell commands, visit: http://docs.datastax.com/en/cql/3.1/cql/cql_reference/cqlshCommandsTOC.html CQL commands CQL is very similar to SQLas we have already seen in this article. Table 4-5, CQL Command Summary lists the language commands. CQL, like SQL, is based on sentences/statements.These sentences are for data manipulation and work with their logical container, the key space. The same as SQL statements, they must end with a semicolon (;) Command Description ALTER KEYSPACE Change property values of a keyspace. ALTER TABLE Modify the column metadata of a table. ALTER TYPE Modify a user-defined type. Cassandra 2.1 and later. ALTER USER Alter existing user options. BATCH Write multiple DML statements. CREATE INDEX Define a new index on a single column of a table. CREATE KEYSPACE Define a new keyspace and its replica placement strategy. CREATE TABLE Define a new table. CREATE TRIGGER Registers a trigger on a table. CREATE TYPE Create a user-defined type. Cassandra 2.1 and later. CREATE USER Create a new user. DELETE Removes entire rows or one or more columns from one or more rows. DESCRIBE Provides information about the connected Cassandra cluster, or about the data objects stored in the cluster. DROP INDEX Drop the named index. DROP KEYSPACE Remove the keyspace. DROP TABLE Remove the named table. DROP TRIGGER Removes registration of a trigger. DROP TYPE Drop a user-defined type. Cassandra 2.1 and later. DROP USER Remove a user. GRANT Provide access to database objects. INSERT Add or update columns. LIST PERMISSIONS List permissions granted to a user. LIST USERS List existing users and their superuser status. REVOKE Revoke user permissions. SELECT Retrieve data from a Cassandra table. TRUNCATE Remove all data from a table. UPDATE Update columns in a row. USE Connect the client session to a keyspace. Table 4-5. CQL command summary For more detailed information of CQL commands visit: http://docs.datastax.com/en/cql/3.1/cql/cql_reference/cqlCommandsTOC.html DBMS Cluster The idea of ​​Cassandra is a database working in a cluster, that is databases on multiple nodes. Although primarily intended for Cassandra Linux distributions is building clusters on Linux servers, Cassandra offers the possibility to build clusters on Windows computers. The first task that must be done prior to setting up the cluster on Windows computers is opening the firewall for Cassandra DBMS DataStax OpsCenter. Ports that must be open for Cassandra are 7000 and 9160. For OpsCenter, the ports are 7199, 8888, 61620 and 61621. These ports are the default when we install Cassandra and OpsCenter, however, unless it is necessary, we can specify new ports. Immediately after installing Cassandra and OpsCenter on a Windows computer, it is necessary to stop the DataStax OpsCenter service, the DataStax OpsCenter agent like in Figure 4-9,Microsoft Windows display services. Figure 4-9: Microsoft Windows display services One of Cassandra’s advantages is that it automatically distributes data in the computers of the cluster using the algorithm for the incoming data. To successfully perform this, it is necessary to assign tokens to each computer in the cluster. The token is a numeric identifier that indicates the computer’s position in the cluster and the data scope in the cluster responsible for that computer. For a successful token generation can be used Python that comes within the Cassandra installation located in the DataStax’s installation directory. In the code for generating tokens, the variable num = 2 refers to the number of computers in the cluster: $ python -c “num=2; print ““n”“.join([(““token %d: %d”“ %(i,(i*(2**127)/num))) for i in range(0,num)])” We will see an output like this: token 0: 0 token 1: 88743298547982745894789547895490438209 It is necessary to preserve the value of the token because they will be required in the following steps. We now need to configure the cassandra.yaml file which we have already met in the authentication and authorization section. The cassandra.yaml file must be configured separately on each computer in the cluster. After opening the file, you need to make the following changes: Initial_token On each computer in the cluster, copy the tokens generated. It should start from the token 0 and assign each computer a unique token. Listen_adress In this section, we will enter the IP of the computer used. Seeds You need to enter the IP address of the primary (main) node in the cluster. Once the file is modified and saved, you must restart DataStax Community Server as we already saw. This should be done only on the primary node. After that it is possible to check if the cluster nodes have communication using the node tool. In node tool, enter the following command: nodetool -h localhost ring If the cluster works, we will see the following result: AddressDCRackStatusStateLeadOwnsToken -datacenter1rack1UpNormal13.41 Kb50.0%88743298547982745894789547895490438209 -datacenter1rack1UpNormal6.68 Kb50.0%88743298547982745894789547895490438209 If the cluster is operating normally,select which computer will be the primary OpsCenter (may not be the primary node). Then on that computer open opscenter.conf which can be found in the DataStax’s installation directory. In that directory, you need to find the webserver interface section and set the parameter to the value 0.0.0.0. After that, in the agent section, change the incoming_interfaceparameter to your computer IP address. In DataStax’s installation directory (on each computer in the cluster) we must configure the address.yamlfile. Within these files, set the stomp_interface local_interfaceparameters and to the IP address of the computer where the file is configured. Now the primary computer should run the DataStax OpsCenter Community and DataStax OpsCenter agent services. After that, runcomputers the DataStax OpsCentar agent service on all the nodes. At this point it is possible to open DataStax OpsCenter with anInternet browser and OpsCenter should look like Figure 4-10, Display cluster in OpsCenter. Figure 4-10: Display cluster in OpsCenter Deleting the database In Apache Cassandra, there are several ways to delete the database (key space) or parts of the database (column family, individual rows within the family row, and so on). Although the easiest way to make a deletion is using the DataStax OpsCenter data modeling tool, there are commands that can be executed through the Cassandra-CLI or the CQL shell. CLI delete commands InTable 4-6, we have the CLI delete commands: CLI Command Function part Used to delete a great column, a column from the column family or rows within certain columns drop columnfamily Delete column family and all data contained on them drop keyspace Delete the key space, all the column families and the data contained on them. truncate Delete all the data from the selected column family Table 4-6 CLI delete commands CQL shell delete commands  In Table 4-7, we have the shell delete commands: CQL shell command Function alter_drop Delete specified column from the column family delete Delete one or more columns from one or more rows of the selected column family delete_columns Delete columns from the column family delete_where Delete individual rows drop_table Delete the selected column family and all the data contained on it drop_columnfamily Delete column family and all the data contained on it drop_keyspace Delete the key space, all the column families and all the data contained on them. truncate Delete all data from the selected column family. Table 4-7 CQL Shell delete commands DB and DBMS optimization Cassandra optimization is specified in the cassandra.yamlfile and these properties are used to adjust the performance and specify the use of system resources such as disk I/O, memory, and CPU usage. column_index_size_in_kb: Initial value: 64 Kb Range of values: - Column indices added to each row after the data reached the default size of 64 Kilobytes. commitlog_segment_size_in_mb Initial value: 32 Mb Range of values: 8-1024 Mb Determines the size of the commit log segment. The commit log segment is archived to be obliterated or recycled after they are transferred to the SRM table. commitlog_sync Initial value: - Range of values: - In Cassandra, this method is used for entry reception. This method is closely correlated with commitlog_sync_period_in_ms that controls how often log is synchronized with the disc. commitlog_sync_period_in_ms Initial value: 1000 ms Range of values: - Decides how often to send the commit log to disk when commit_sync is in periodic mode. commitlog_total_space_in_mb Initial value: 4096 MB Range of values: - When the size of the commit log reaches an initial value, Cassandra removes the oldest parts of the commit log. This reduces the data amount and facilitates the launch of fixtures. compaction_preheat_key_cache Initial value: true Range of values: true / false When this value is set to true, the stored key rows are monitored during compression, and after resaves it to a new location in the compressed SSTable. compaction_throughput_mb_per_sec Initial value: 16 Range of values: 0-32 Compression damping the overall bandwidth throughout the system. Faster data insertion means faster compression. concurrent_compactors Initial value: 1 per CPU core Range of values: depends on the number of CPU cores Adjusts the number of simultaneous compression processes on the node. concurrent_reads Initial value: 32 Range of values: - When there is more data than the memory can fit, a bottleneck occurs in reading data from disk. concurrent_writes Initial value: 32 Range of values: - Making inserts in Cassandra does not depend on I/O limitations. Concurrent inserts depend on the number of CPU cores. The recommended number of cores is 8. flush_largest_memtables_at Initial value: 0.75 Range of values: - This parameter clears the biggest memtable to free disk space. This parameter can be used as an emergency measure to prevent memory loss (out of memory errors) in_memory_compaction_limit_in_mb Initial value: 64 Range of values: Limit order size on the memory. Larger orders use a slower compression method. index_interval Initial value: 128 Value range: 128-512 Controlled sampling records from the first row of the index in the ratio of space and time, that is, the larger the time interval to be sampled the less effective. In technical terms, the interval corresponds to the number of index samples skipped between taking each sample. memtable_flush_queue_size Initial value: 4 Range of values: a minimum set of the maximum number of secondary indexes that make more than one Column family Indicates the total number of full-memtable to allow a flush, that is, waiting to the write thread. memtable_flush_writers Initial value: 1 (according to the data map) Range of values: - Number of memtable flush writer threads. These threads are blocked by the disk I/O, and each thread holds a memtable in memory until it is blocked. memtable_total_space_in_mb Initial value: 1/3 Java Heap Range of values: - Total amount of memory used for all the Column family memtables on the node. multithreaded_compaction Initial value: false Range of values: true/false Useful only on nodes using solid state disks reduce_cache_capacity_to Initial value: 0.6 Range of values: - Used in combination with reduce_cache_capacity_at. When Java Heap reaches the value of reduce_cache_size_at, this value is the total cache size to reduce the percentage to the declared value (in this case the size of the cache is reduced to 60%). Used to avoid unexpected out-of-memory errors. reduce_cache_size_at Initial value: 0.85 Range of values: 1.0 (disabled) When Java Heap marked to full sweep by the garbage Collector reaches a percentage stated on this variable (85%), Cassandra reduces the size of the cache to the value of the variable reduce_cache_capacity_to. stream_throughput_outbound_megabits_per_sec Initial value: off, that is, 400 Mbps (50 Mb/s) Range of values: - Regulate the stream of output file transfer in a node to a given throughput in Mbps. This is necessary because Cassandra mainly do sequential I/O when it streams data during system startup or repair, which can lead to network saturation and affect Remote Procedure Call performance. Bloom filter Every SSTable has a Bloom filter. In data requests, the Bloom filter checks whether the requested order exists in the SSTable before any disk I/O. If the value of the Bloom filter is too low, it may cause seizures of large amounts of memory, respectively, a higher Bloom filter value, means less memory use. The Bloom filter range of values ​​is from 0.000744 to 1.0. It is recommended keep the minimum value of the Bloom filter less than 0.1. The value of the Bloom filter column family is adjusted through the CQL shell as follows: ALTER TABLE <column_family> WITH bloom_filter_fp_chance = 0.01; Data cache Apache Cassandra has two caches by which it achieves highly efficient data caching. These are: cache key (default: enabled): cache index primary key columns families row cache (default: disabled): holding a row in memory so that reading can be done without using the disc If the key and row cache set, the query of data is accomplished in the way shown in Figure 4-11, Apache Cassandra Cache. Figure 4-11: Apache Cassandra cache When information is requested, first it checks in the row cache, if the information is available, then row cache returns the result without reading from the disk. If it has come from a request and the row cache can return a result, it checks if the data can be retrieved through the key cache, which is more efficient than reading from the disk, the retrieved data is finally written to the row cache. As the key cache memory stores the key location of an individual column family, any increase in key cache has a positive impact on reading data for the column family. If the situation permits, a combination of key cache and row cache increases the efficiency. It is recommended that the size of the key cache is set in relation to the size of the Java heap. Row cache is used in situations where data access patterns follow a normal (Gaussian) distribution of rows that contain often-read data and queries often returning data from the most or all the columns. Within cassandra.yaml files, we have the following options to configure the data cache: key_cache_size_in_mb Initial value: empty, meaning“Auto” (min (5% Heap (in MB), 100MB)) Range of values: blank or 0 (disabled key cache) Variable that defines the key cache size per node row_cache_size_in_mb Initial value: 0 (disabled) Range of values: - Variable that defines the row cache size per node key_cache_save_period Initial value: 14400 (i.e. 4 hours) Range of values: - Variable that defines the save frequency of key cache to disk row_cache_save_period Initial value: 0 (disabled) Range of values: - Variable that defines the save frequency of row cache to disk row_cache_provider Initial value: SerializingCacheProvider Range of values: ConcurrentLinkedHashCacheProvider or SerializingCacheProvider Variable that defines the implementation of row cache Java heap tune up Apache Cassandra interacts with the operating system using the Java virtual machine, so the Java heap size plays an important role. When starting Cassandra, the size of the Java Heap is set automatically based on the total amount of RAM (Table 4-8, Determination of the Java heap relative to the amount of RAM). The Java heap size can be manually adjusted by changing the values ​​of the following variables contained on the file cassandra-env.sh located in the directory...apache-cassandraconf. # MAX_HEAP_SIZE = “4G” # HEAP_NEWSIZE = “800M” Total system memory Java heap size < 2 Gb Half of the system memory 2 Gb - 4 Gb 1 Gb > 4 Gb One quarter of the system memory, no more than 8 Gb Table 4-8: Determination of the Java heap relative to the amount of RAM Java garbage collection tune up Apache Cassandra has a GC Inspector which is responsible for collecting information on each garbage collection process longer than 200ms. The Garbage Collection Processes that occur frequently and take a lot of time (as concurrent mark-sweep which takes several seconds) indicate that there is a great pressure on garbage collection and in the JVM. The recommendations to address these issues include: Add new nodes Reduce the cache size Adjust items related to the JVM garbage collection Views, triggers, and stored procedures By definition (In RDBMS) view represents a virtual table that acts as a real (created) table, which in reality does not contain any data. The obtained data isthe result of a SELECT query. View consists of a rows and columns combination of one or more different tables. Respectively in NoSQL, in Cassandra all data for key value rows are placed in one Column family. As in NoSQL, there is noJOIN commands and there is no possibility of flexible queries, the SELECT command lists the actual data, but there is no display options for a virtual table, that is, a view. Since Cassandra does not belong to the RDBMS group, there is no possibility of creating triggers and stored procedures. RI Restrictions can be set only in the application code Also, as Cassandra does not belong to the RDBMS group, we cannot apply Codd’s rules. Client-server architecture At this point, we have probably already noticed that Apache Cassandra runs on a client-server architecture. By definition, the client-server architecture allows distributed applications, since the tasks are divided into two main parts: On one hand, service providers: the servers. On the other hand, the service petitioners:  the clients. In this architecture, several clients are allowed to access the server; the server is responsible for meeting requests and handle each one according its own rules. So far, we have only used one client, managed from the same machine, that is, from the same data network. CQLs allows us to connect to Cassandra, access a key space, and send CQL statements to the Cassandra server. This is the most immediate method, but in daily practice, it is common to access the key spaces from different execution contexts (other systems and other programming languages). Thus, we require other clients different from CQLs, to do it in the Apache Cassandra context, we require connection drivers. Drivers A driver is just a software component that allows access to a key space to run CQL statements. Fortunately, there arealready a lot of drivers to create clients for Cassandra in almost any modern programming language, you can see an extensive list at this URL:http://wiki.apache.org/cassandra/ClientOptions. Typically, in a client-server architecture there are different clients accessing the server from different clients, which are distributed in different networks. Our implementation needs will dictate the required clients. Summary NoSQL is not just hype,or ayoung technology; it is an alternative, with known limitations and capabilities. It is not an RDBMS killer. It’s more like a younger brother who is slowly growing up and takes some of the burden. Acceptance is increasing and it will be even better as NoSQL solutions mature. Skepticism may be justified, but only for concrete reasons. Since Cassandra is an easy and free working environment, suitable for application development, it is recommended, especially with the additional utilities that ease and accelerate database administration. Cassandra has some faults (for example, user authentication and authorization are still insufficiently supportedin Windows environments) and preferably used when there is a need to store large amounts of data. For start-up companies that need to manipulate large amounts of data with the aim of costs reduction, implementing Cassandra in a Linux environment is a must-have. Resources for Article: Further resources on this subject: Getting Started with Apache Cassandra [article] Apache Cassandra: Working in Multiple Datacenter Environments [article] Apache Cassandra: Libraries and Applications [article]
Read more
  • 0
  • 0
  • 5461

article-image-installing-quicksight-application
Packt
20 Jan 2017
4 min read
Save for later

Installing QuickSight Application

Packt
20 Jan 2017
4 min read
In this article by Rajesh Nadipalli, the author of the book Effective Business Intelligence with QuickSight, we will see how you can install the Amazon QuickSight app from the Apple iTunes store for no cost. You can search for the app from the iTunes store and then proceed to download and install or alternatively you can follow this link to download the app. (For more resources related to this topic, see here.) Amazon QuickSight app is certified to work with iOS devices running iOS v9.0 and above. Once you have the app installed, you can then proceed to login to your QuickSight account as shown in the following screenshot: Figure 1.1: QuickSight sign in The Amazon QuickSight app is designed to access dashboards and analyses on your mobile device. All interactions on the app are read-only and changes you make on your device are not applied to the original visuals so that you can explore without any worry. Dashboards on the go After you login to the QuickSight app, you will first see the list of dashboards associated to your QuickSight account for easy access. If you don't see dashboards, then click on Dashboards icon from the menu at the bottom of your mobile device as shown in the following screenshot: Figure 1.2: Accessing dashboards You will now see the list of dashboards associated to your user ID. Dashboard detailed view From the dashboard listing, select the USA Census Dashboard, which will then redirect you to the detailed dashboard view. In the detailed dashboard view you will see all visuals that are part of that dashboard. You can click on the arrow to the extreme top right of each visual to open the specific chart in full screen mode as shown in the following screenshot. In the scatter plot analysis shown in the following screenshot, you can further click on any of the dots to get specific values about that bubble. In the following screenshot the selected circle is for zip code 94027 which has PopulationCount of 7,089 and MedianIncome of $216,905 and MeanIncome of $336,888: Figure 1.3: Dashboard visual Dashboard search QuickSight mobile app also provides a search feature, which is handy if you know only partial name of the dashboard. Follow the following steps to search for a dashboard: First ensure you are in the dashboards tab by clicking on the Dashboards icon from the bottom menu. Next click on the search icon seen on the top right corner. Next type the partial name. In the following example, i have typed Usa. QuickSight now searches for all dashboards that have the word Usa in it and lists them out. You can next click on the dashboard to get details about that specific dashboard as shown in the following screenshot: Figure 1.4: Dashboard search Favorite a dashboard QuickSight provides a convenient way to bookmark your dashboards by setting them as favorites. To use this feature, first identify which dashboards you often use and click on the star icon to it's right side as shown in the following screenshot. Next to access all of your favorites, click on the Favorites tab and the list is then refined to only those dashboards you had previously identified as favorite: Figure 1.5: Dashboard favorites Limitations of mobile app While dashboards are fairly easy to interact with on the mobile app, there are key limitations when compared to the standard browser version, which I am listing as follows: You cannot create share dashboards to others using the mobile app. You cannot zoom in/out from the visual, which would be really good in scenarios where the charts are dense. Chart legends are not shown. Summary We have seen how to install Amazon QuickSight app and using this app you can browse, search, and view dashboards. We have covered how to access dashboards, search, favorite, and its detailed view. We have also seen some limitations of mobile app. Resources for Article: Further resources on this subject: Introduction to Practical Business Intelligence [article] MicroStrategy 10 [article] Making Your Data Everything It Can Be [article]
Read more
  • 0
  • 0
  • 1220

article-image-clustering-model-spark
Packt
19 Jan 2017
7 min read
Save for later

Clustering Model with Spark

Packt
19 Jan 2017
7 min read
In this article by Manpreet Singh Ghotra and Rajdeep Dua, coauthors of the book Machine Learning with Spark, Second Edition, we will analyze the case where we do not have labeled data available. Supervised learning methods are those where the training data is labeled with the true outcome that we would like to predict (for example, a rating for recommendations and class assignment for classification or a real target variable in the case of regression). (For more resources related to this topic, see here.) In unsupervised learning, the model is not supervised with the true target label. The unsupervised case is very common in practice, since obtaining labeled training data can be very difficult or expensive in many real-world scenarios (for example, having humans label training data with class labels for classification). However, we would still like to learn some underlying structure in the data and use these to make predictions. This is where unsupervised learning approaches can be useful. Unsupervised learning models are also often combined with supervised models, for example, applying unsupervised techniques to create new input features for supervised models. Clustering models are, in many ways, the unsupervised equivalent of classification models. With classification, we would try to learn a model that would predict which class a given training example belonged to. The model is essentially a mapping from a set of features to the class. In clustering, we would like to segment the data in such a way that each training example is assigned to a segment called a cluster. The clusters act much like classes, except that the true class assignments are unknown. Clustering models have many use cases that are the same as classification; these include the following: Segmenting users or customers into different groups based on behavior characteristics and metadata Grouping content on a website or products in a retail business Finding clusters of similar genes Segmenting communities in ecology Creating image segments for use in image analysis applications such as object detection Types of clustering models There are many different forms of clustering models available, ranging from simple to extremely complex ones. The Spark MLlibrary currently provides K-means clustering, which is among the simplest approaches available. However, it is often very effective, and its simplicity makes it is relatively easy to understand and is scalable. K-means clustering K-means attempts to partition a set of data points into K distinct clusters (where K is an input parameter for the model). More formally, K-means tries to find clusters so as to minimize the sum of squared errors (or distances) within each cluster. This objective function is known as the within cluster sum of squared errors (WCSS). It is the sum, over each cluster, of the squared errors between each point and the cluster center. Starting with a set of K initial cluster centers (which are computed as the mean vector for all data points in the cluster), the standard method for K-means iterates between two steps: Assign each data point to the cluster that minimizes the WCSS. The sum of squares is equivalent to the squared Euclidean distance; therefore, this equates to assigning each point to the closest cluster center as measured by the Euclidean distance metric. Compute the new cluster centers based on the cluster assignments from the first step. The algorithm proceeds until either a maximum number of iterations has been reached or convergence has been achieved. Convergence means that the cluster assignments no longer change during the first step; therefore, the value of the WCSS objective function does not change either. For more details, refer to Spark's documentation on clustering at http://spark.apache.org/docs/latest/mllib-clustering.html or refer to http://en.wikipedia.org/wiki/K-means_clustering. To illustrate the basics of K-means, we will use a simple dataset. We have five classes, which are shown in the following figure: Multiclass dataset However, assume that we don't actually know the true classes. If we use K-means with five clusters, then after the first step, the model's cluster assignments might look like this: Cluster assignments after the first K-means iteration We can see that K-means has already picked out the centers of each cluster fairly well. After the next iteration, the assignments might look like those shown in the following figure: Cluster assignments after the second K-means iteration Things are starting to stabilize, but the overall cluster assignments are broadly the same as they were after the first iteration. Once the model has converged, the final assignments could look like this: Final cluster assignments for K-means As we can see, the model has done a decent job of separating the five clusters. The leftmost three are fairly accurate (with a few incorrect points). However, the two clusters in the bottom-right corner are less accurate. This illustrates the following: The iterative nature of K-means The model's dependency on the method of initially selecting clusters' centers (here, we will use a random approach) How the final cluster assignments can be very good for well-separated data but can be poor for data that is more difficult Initialization methods The standard initialization method for K-means, usually simply referred to as the random method, starts by randomly assigning each data point to a cluster before proceeding with the first update step. Spark ML provides a parallel variant for this initialization method, called K-means++, which is the default initialization method used. Refer to http://en.wikipedia.org/wiki/K-means_clustering#Initialization_methods and http://en.wikipedia.org/wiki/K-means%2B%2B for more information. The results of using K-means++ are shown here. Note that this time, the difficult bottom-right points have been mostly correctly clustered. Final cluster assignments for K-means++ Variants There are many other variants of K-means; they focus on initialization methods or the core model. One of the more common variants is fuzzy K-means. This model does not assign each point to one cluster as K-means does (a so-called hard assignment). Instead, it is a soft version of K-means, where each point can belong to many clusters and is represented by the relative membership to each cluster. So, for K clusters, each point is represented as a K-dimensional membership vector, with each entry in this vector indicating the membership proportion in each cluster. Mixture models A mixture model is essentially an extension of the idea behind fuzzy K-means; however, it makes an assumption that there is an underlying probability distribution that generates the data. For example, we might assume that the data points are drawn from a set of K-independent Gaussian (normal) probability distributions. The cluster assignments are also soft, so each point is represented by K membership weights in each of the K underlying probability distributions. Refer to http://en.wikipedia.org/wiki/Mixture_model for further details and for a mathematical treatment of mixture models. Hierarchical clustering Hierarchical clustering is a structured clustering approach that results in a multilevel hierarchy of clusters where each cluster might contain many subclusters (or child clusters). Each child cluster is, thus, linked to the parent cluster. This form of clustering is often also called tree clustering. Agglomerative clustering is a bottom-up approach where we have the following: Each data point begins in its own cluster The similarity (or distance) between each pair of clusters is evaluated The pair of clusters that are most similar are found; this pair is then merged to form a new cluster The process is repeated until only one top-level cluster remains Divisive clustering is a top-down approach that works in reverse, starting with one cluster, and at each stage, splitting a cluster into two, until all data points are allocated to their own bottom-level cluster. You can find more information at http://en.wikipedia.org/wiki/Hierarchical_clustering. Summary In this article, we explored a new class of model that learns structure from unlabeled data—unsupervised learning. You learned about various clustering models like the K-means model, mixture models, and the hierarchical clustering model. We also considered a simple dataset to illustrate the basics of K-means. Resources for Article: Further resources on this subject: Spark for Beginners [article] Setting up Spark [article] Holistic View on Spark [article]
Read more
  • 0
  • 0
  • 1081
Visually different images

article-image-using-firebase-real-time-database
Oliver Blumanski
18 Jan 2017
5 min read
Save for later

Using the Firebase Real-Time Database

Oliver Blumanski
18 Jan 2017
5 min read
In this post, we are going to look at how to use the Firebase real-time database, along with an example. Here we are writing and reading data from the database using multiple platforms. To do this, we first need a server script that is adding data, and secondly we need a component that pulls the data from the Firebase database. Step 1 - Server Script to collect data Digest an XML feed and transfer the data into the Firebase real-time database. The script runs as cronjob frequently to refresh the data. Step 2 - App Component Subscribe to the data from a JavaScript component, in this case, React-Native. About Firebase Now that those two steps are complete, let's take a step back and talk about Google Firebase. Firebase offers a range of services such as a real-time database, authentication, cloud notifications, storage, and much more. You can find the full feature list here. Firebase covers three platforms: iOS, Android, and Web. The server script uses the Firebases JavaScript Web API. Having data in this real-time database allows us to query the data from all three platforms (iOS, Android, Web), and in addition, the real-time database allows us to subscribe (listen) to a database path (query), or to query a path once. Step 1 - Digest XML feed and transfer into Firebase Firebase Set UpThe first thing you need to do is to set up a Google Firebase project here In the app, click on "Add another App" and choose Web, a pop-up will show you the configuration. You can copy paste your config into the example script. Now you need to set the rules for your Firebase database. You should make yourself familiar with the database access rules. In my example, the path latestMarkets/ is open for write and read. In a real-world production app, you would have to secure this, having authentication for the write permissions. Here are the database rules to get started: { "rules": { "users": { "$uid": { ".read": "$uid === auth.uid", ".write": "$uid === auth.uid" } }, "latestMarkets": { ".read": true, ".write": true } } } The Server Script Code The XML feed contains stock market data and is frequently changing, except on the weekend. To build the server script, some NPM packages are needed: Firebase Request xml2json babel-preset-es2015 Require modules and configure Firebase web api: const Firebase = require('firebase'); const request = require('request'); const parser = require('xml2json'); // firebase access config const config = { apiKey: "apikey", authDomain: "authdomain", databaseURL: "dburl", storageBucket: "optional", messagingSenderId: "optional" } // init firebase Firebase.initializeApp(config) [/Code] I write JavaScript code in ES6. It is much more fun. It is a simple script, so let's have a look at the code that is relevant to Firebase. The code below is inserting or overwriting data in the database. For this script, I am happy to overwrite data: Firebase.database().ref('latestMarkets/'+value.Symbol).set({ Symbol: value.Symbol, Bid: value.Bid, Ask: value.Ask, High: value.High, Low: value.Low, Direction: value.Direction, Last: value.Last }) .then((response) => { // callback callback(true) }) .catch((error) => { // callback callback(error) }) Firebase Db first references the path: Firebase.database().ref('latestMarkets/'+value.Symbol) And then the action you want to do: // insert/overwrite (promise) Firebase.database().ref('latestMarkets/'+value.Symbol).set({}).then((result)) // get data once (promise) Firebase.database().ref('latestMarkets/'+value.Symbol).once('value').then((snapshot)) // listen to db path, get data on change (callback) Firebase.database().ref('latestMarkets/'+value.Symbol).on('value', ((snapshot) => {}) // ...... Here is the Github repository: Displaying the data in a React-Native app This code below will listen to a database path, on data change, all connected devices will synchronise the data: Firebase.database().ref('latestMarkets/').on('value', snapshot => { // do something with snapshot.val() }) To close the listener, or unsubscribe the path, one can use "off": Firebase.database().ref('latestMarkets/').off() I’ve created an example react-native app to display the data: The Github repository Conclusion In mobile app development, one big question is: "What database and cache solution can I use to provide online and offline capabilities?" One way to look at this question is like you are starting a project from scratch. If so, you can fit your data into Firebase, and then this would be a great solution for you. Additionally, you can use it for both web and mobile apps. The great thing is that you don't need to write a particular API, and you can access data straight from JavaScript. On the other hand, if you have a project that uses MySQL for example, the Firebase real-time database won't help you much. You would need to have a remote API to connect to your database in this case. But even if using the Firebase database isn't a good fit for your project, there are still other features, such as Firebase Storage or Cloud Messaging, which are very easy to use, and even though they are beyond the scope of this post, they are worth checking out. About the author Oliver Blumanski is a developer based out of Townsville, Australia. He has been a software developer since 2000, and can be found on GitHub at @blumanski.
Read more
  • 0
  • 0
  • 6501

article-image-basic-operations-elasticsearch
Packt
16 Jan 2017
10 min read
Save for later

Basic Operations of Elasticsearch

Packt
16 Jan 2017
10 min read
In this article by Alberto Maria Angelo Paro, the author of the book ElasticSearch 5.0 Cookbook - Third Edition, you will learn the following recipes: Creating an index Deleting an index Opening/closing an index Putting a mapping in an index Getting a mapping (For more resources related to this topic, see here.) Creating an index The first operation to do before starting indexing data in Elasticsearch is to create an index--the main container of our data. An index is similar to the concept of database in SQL, a container for types (tables in SQL) and documents (records in SQL). Getting ready To execute curl via the command line you need to install curl for your operative system. How to do it... The HTTP method to create an index is PUT (but also POST works); the REST URL contains the index name: http://<server>/<index_name> For creating an index, we will perform the following steps: From the command line, we can execute a PUT call: curl -XPUT http://127.0.0.1:9200/myindex -d '{ "settings" : { "index" : { "number_of_shards" : 2, "number_of_replicas" : 1 } } }' The result returned by Elasticsearch should be: {"acknowledged":true,"shards_acknowledged":true} If the index already exists, a 400 error is returned: { "error" : { "root_cause" : [ { "type" : "index_already_exists_exception", "reason" : "index [myindex/YJRxuqvkQWOe3VuTaTbu7g] already exists", "index_uuid" : "YJRxuqvkQWOe3VuTaTbu7g", "index" : "myindex" } ], "type" : "index_already_exists_exception", "reason" : "index [myindex/YJRxuqvkQWOe3VuTaTbu7g] already exists", "index_uuid" : "YJRxuqvkQWOe3VuTaTbu7g", "index" : "myindex" }, "status" : 400 } How it works... Because the index name will be mapped to a directory on your storage, there are some limitations to the index name, and the only accepted characters are: ASCII letters [a-z] Numbers [0-9] point ".", minus "-", "&" and "_" During index creation, the replication can be set with two parameters in the settings/index object: number_of_shards, which controls the number of shards that compose the index (every shard can store up to 2^32 documents) number_of_replicas, which controls the number of replica (how many times your data is replicated in the cluster for high availability)A good practice is to set this value at least to 1. The API call initializes a new index, which means: The index is created in a primary node first and then its status is propagated to all nodes of the cluster level A default mapping (empty) is created All the shards required by the index are initialized and ready to accept data The index creation API allows defining the mapping during creation time. The parameter required to define a mapping is mapping and accepts multi mappings. So in a single call it is possible to create an index and put the required mappings. There's more... The create index command allows passing also the mappings section, which contains the mapping definitions. It is a shortcut to create an index with mappings, without executing an extra PUT mapping call: curl -XPOST localhost:9200/myindex -d '{ "settings" : { "number_of_shards" : 2, "number_of_replicas" : 1 }, "mappings" : { "order" : { "properties" : { "id" : {"type" : "keyword", "store" : "yes"}, "date" : {"type" : "date", "store" : "no" , "index":"not_analyzed"}, "customer_id" : {"type" : "keyword", "store" : "yes"}, "sent" : {"type" : "boolea+n", "index":"not_analyzed"}, "name" : {"type" : "text", "index":"analyzed"}, "quantity" : {"type" : "integer", "index":"not_analyzed"}, "vat" : {"type" : "double", "index":"no"} } } } }' Deleting an index The counterpart of creating an index is deleting one. Deleting an index means deleting its shards, mappings, and data. There are many common scenarios when we need to delete an index, such as: Removing the index to clean unwanted/obsolete data (for example, old Logstash indices). Resetting an index for a scratch restart. Deleting an index that has some missing shard, mainly due to some failures, to bring back the cluster in a valid state (if a node dies and it's storing a single replica shard of an index, this index is missing a shard so the cluster state becomes red. In this case, you'll bring back the cluster to a green status, but you lose the data contained in the deleted index). Getting ready To execute curl via command line you need to install curl for your operative system. The index created is required to be deleted. How to do it... The HTTP method used to delete an index is DELETE. The following URL contains only the index name: http://<server>/<index_name> For deleting an index, we will perform the steps given as follows: Execute a DELETE call, by writing the following command: curl -XDELETE http://127.0.0.1:9200/myindex We check the result returned by Elasticsearch. If everything is all right, it should be: {"acknowledged":true} If the index doesn't exist, a 404 error is returned: { "error" : { "root_cause" : [ { "type" : "index_not_found_exception", "reason" : "no such index", "resource.type" : "index_or_alias", "resource.id" : "myindex", "index_uuid" : "_na_", "index" : "myindex" } ], "type" : "index_not_found_exception", "reason" : "no such index", "resource.type" : "index_or_alias", "resource.id" : "myindex", "index_uuid" : "_na_", "index" : "myindex" }, "status" : 404 } How it works... When an index is deleted, all the data related to the index is removed from disk and is lost. During the delete processing, first the cluster is updated, and then the shards are deleted from the storage. This operation is very fast; in a traditional filesystem it is implemented as a recursive delete. It's not possible restore a deleted index, if there is no backup. Also calling using the special _all index_name can be used to remove all the indices. In production it is good practice to disable the all indices deletion by adding the following line to Elasticsearch.yml: action.destructive_requires_name:true Opening/closing an index If you want to keep your data, but save resources (memory/CPU), a good alternative to delete indexes is to close them. Elasticsearch allows you to open/close an index to put it into online/offline mode. Getting ready To execute curl via the command line you need to install curl for your operative system. How to do it... For opening/closing an index, we will perform the following steps: From the command line, we can execute a POST call to close an index using: curl -XPOST http://127.0.0.1:9200/myindex/_close If the call is successful, the result returned by Elasticsearch should be: {,"acknowledged":true} To open an index, from the command line, type the following command: curl -XPOST http://127.0.0.1:9200/myindex/_open If the call is successful, the result returned by Elasticsearch should be: {"acknowledged":true} How it works... When an index is closed, there is no overhead on the cluster (except for metadata state): the index shards are switched off and they don't use file descriptors, memory, and threads. There are many use cases when closing an index: Disabling date-based indices (indices that store their records by date), for example, when you keep an index for a week, month, or day and you want to keep online a fixed number of old indices (that is, two months) and some offline (that is, from two months to six months). When you do searches on all the active indices of a cluster and don't want search in some indices (in this case, using alias is the best solution, but you can achieve the same concept of alias with closed indices). An alias cannot have the same name as an index When an index is closed, calling the open restores its state. Putting a mapping in an index We saw how to build mapping by indexing documents. This recipe shows how to put a type mapping in an index. This kind of operation can be considered as the Elasticsearch version of an SQL created table. Getting ready To execute curl via the command line you need to install curl for your operative system. How to do it... The HTTP method to put a mapping is PUT (also POST works). The URL format for putting a mapping is: http://<server>/<index_name>/<type_name>/_mapping For putting a mapping in an index, we will perform the steps given as follows: If we consider the type order, the call will be: curl -XPUT 'http://localhost:9200/myindex/order/_mapping' -d '{ "order" : { "properties" : { "id" : {"type" : "keyword", "store" : "yes"}, "date" : {"type" : "date", "store" : "no" , "index":"not_analyzed"}, "customer_id" : {"type" : "keyword", "store" : "yes"}, "sent" : {"type" : "boolean", "index":"not_analyzed"}, "name" : {"type" : "text", "index":"analyzed"}, "quantity" : {"type" : "integer", "index":"not_analyzed"}, "vat" : {"type" : "double", "index":"no"} } } }' In case of success, the result returned by Elasticsearch should be: {"acknowledged":true} How it works... This call checks if the index exists and then it creates one or more type mapping as described in the definition. During mapping insert if there is an existing mapping for this type, it is merged with the new one. If there is a field with a different type and the type could not be updated, an exception expanding fields property is raised. To prevent an exception during the merging mapping phase, it's possible to specify the ignore_conflicts parameter to true (default is false). The put mapping call allows you to set the type for several indices in one shot; list the indices separated by commas or to apply all indexes using the _all alias. There's more… There is not a delete operation for mapping. It's not possible to delete a single mapping from an index. To remove or change a mapping you need to manage the following steps: Create a new index with the new/modified mapping Reindex all the records Delete the old index with incorrect mapping Getting a mapping After having set our mappings for processing types, we sometimes need to control or analyze the mapping to prevent issues. The action to get the mapping for a type helps us to understand structure or its evolution due to some merge and implicit type guessing. Getting ready To execute curl via command-line you need to install curl for your operative system. How to do it… The HTTP method to get a mapping is GET. The URL formats for getting mappings are: http://<server>/_mapping http://<server>/<index_name>/_mapping http://<server>/<index_name>/<type_name>/_mapping To get a mapping from the type of an index, we will perform the following steps: If we consider the type order of the previous chapter, the call will be: curl -XGET 'http://localhost:9200/myindex/order/_mapping?pretty=true' The pretty argument in the URL is optional, but very handy to pretty print the response output. The result returned by Elasticsearch should be: { "myindex" : { "mappings" : { "order" : { "properties" : { "customer_id" : { "type" : "keyword", "store" : true }, … truncated } } } } } How it works... The mapping is stored at the cluster level in Elasticsearch. The call checks both index and type existence and then it returns the stored mapping. The returned mapping is in a reduced form, which means that the default values for a field are not returned. Elasticsearch stores only not default field values to reduce network and memory consumption. Retrieving a mapping is very useful for several purposes: Debugging template level mapping Checking if implicit mapping was derivated correctly by guessing fields Retrieving the mapping metadata, which can be used to store type-related information Simply checking if the mapping is correct If you need to fetch several mappings, it is better to do it at index level or cluster level to reduce the numbers of API calls. Summary We learned how to manage indices and perform operations on documents. We'll discuss different operations on indices such as create, delete, update, open, and close. These operations are very important because they allow better define the container (index) that will store your documents. The index create/delete actions are similar to the SQL create/delete database commands. Resources for Article: Further resources on this subject: Elastic Stack Overview [article] Elasticsearch – Spicing Up a Search Using Geo [article] Downloading and Setting Up ElasticSearch [article]
Read more
  • 0
  • 0
  • 4308

article-image-flink-complex-event-processing
Packt
16 Jan 2017
13 min read
Save for later

Flink Complex Event Processing

Packt
16 Jan 2017
13 min read
In this article by Tanmay Deshpande, the author of the book Mastering Apache Flink, we will learn the Table API provided by Apache Flink and how we can use it to process relational data structures. We will start learning more about the libraries provided by Apache Flink and how we can use them for specific use cases. To start with, let's try to understand a library called complex event processing (CEP). CEP is a very interesting but complex topic which has its value in various industries. Wherever there is a stream of events expected, naturally people want to perform complex event processing in all such use cases. Let's try to understand what CEP is all about. (For more resources related to this topic, see here.) What is complex event processing? CEP is a technique to analyze streams of disparate events occurring with high frequency and low latency. These days, streaming events can be found in various industries, for example: In the oil and gas domain, sensor data comes from various drilling tools or from upstream oil pipeline equipment In the security domain, activity data, malware information, and usage pattern data come from various end points In the wearable domain, data comes from various wrist bands with information about your heart beat rate, your activity, and so on In the banking domain, data from credit cards usage, banking activities, and so on It is very important to analyze the variation patterns to get notified in real time about any change in the regular assembly. CEP is able to understand the patterns across the streams of events, sub-events, and their sequences. CEP helps to identify meaningful patterns and complex relationships among unrelated events, and sends notifications in real and near real time to avoid any damage: The preceding diagram shows how the CEP flow works. Even though the flow looks simple, CEP has various abilities such as: Ability to produce results as soon as the input event stream is available Ability to provide computations like aggregation over time and timeout between two events of interest Ability to provide real time/near real time alerts and notifications on detection of complex event patterns Ability to connect and correlate heterogeneous sources and analyze patterns in them Ability to achieve high throughput, low latency processing There are various solutions available in the market. With big data technology advancements, we have multiple options like Apache Spark, Apache Samza, Apache Beam, among others, but none of them have a dedicated library to fit all solutions. Now let us try to understand what we can achieve with Flink's CEP library. Flink CEP Apache Flink provides the Flink CEP library which provides APIs to perform complex event processing. The library consists of the following core components: Event stream Pattern definition Pattern detection Alert generation Flink CEP works on Flink's streaming API called DataStream. A programmer needs to define the pattern to be detected from the stream of events and then Flink's CEP engine detects the pattern and takes appropriate action, such as generating alerts. In order to get started, we need to add following Maven dependency: <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-cep-scala_2.10 --> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-cep-scala_2.10</artifactId> <version>1.1.2</version> </dependency> Event stream A very important component of CEP is its input event stream. We have seen details of DataStream API. Now let's use that knowledge to implement CEP. The very first thing we need to do is define a Java POJO for the event. Let's assume we need to monitor a temperature sensor event stream. First we define an abstract class and then extend this class. While defining the event POJOs we need to make sure that we implement the hashCode() and equals() methods, as while comparing the events, compile will make use of them. The following code snippets demonstrate this. First, we write an abstract class as shown here: package com.demo.chapter05; public abstract class MonitoringEvent { private String machineName; public String getMachineName() { return machineName; } public void setMachineName(String machineName) { this.machineName = machineName; } @Override public int hashCode() { final int prime = 31; int result = 1; result = prime * result + ((machineName == null) ? 0 : machineName.hashCode()); return result; } @Override public boolean equals(Object obj) { if (this == obj) return true; if (obj == null) return false; if (getClass() != obj.getClass()) return false; MonitoringEvent other = (MonitoringEvent) obj; if (machineName == null) { if (other.machineName != null) return false; } else if (!machineName.equals(other.machineName)) return false; return true; } public MonitoringEvent(String machineName) { super(); this.machineName = machineName; } } Then we write the actual temperature event: package com.demo.chapter05; public class TemperatureEvent extends MonitoringEvent { public TemperatureEvent(String machineName) { super(machineName); } private double temperature; public double getTemperature() { return temperature; } public void setTemperature(double temperature) { this.temperature = temperature; } @Override public int hashCode() { final int prime = 31; int result = super.hashCode(); long temp; temp = Double.doubleToLongBits(temperature); result = prime * result + (int) (temp ^ (temp >>> 32)); return result; } @Override public boolean equals(Object obj) { if (this == obj) return true; if (!super.equals(obj)) return false; if (getClass() != obj.getClass()) return false; TemperatureEvent other = (TemperatureEvent) obj; if (Double.doubleToLongBits(temperature) != Double.doubleToLongBits(other.temperature)) return false; return true; } public TemperatureEvent(String machineName, double temperature) { super(machineName); this.temperature = temperature; } @Override public String toString() { return "TemperatureEvent [getTemperature()=" + getTemperature() + ", getMachineName()=" + getMachineName() + "]"; } } Now we can define the event source as shown follows. In Java: StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream<TemperatureEvent> inputEventStream = env.fromElements(new TemperatureEvent("xyz", 22.0), new TemperatureEvent("xyz", 20.1), new TemperatureEvent("xyz", 21.1), new TemperatureEvent("xyz", 22.2), new TemperatureEvent("xyz", 22.1), new TemperatureEvent("xyz", 22.3), new TemperatureEvent("xyz", 22.1), new TemperatureEvent("xyz", 22.4), new TemperatureEvent("xyz", 22.7), new TemperatureEvent("xyz", 27.0)); In Scala: val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment val input: DataStream[TemperatureEvent] = env.fromElements(new TemperatureEvent("xyz", 22.0), new TemperatureEvent("xyz", 20.1), new TemperatureEvent("xyz", 21.1), new TemperatureEvent("xyz", 22.2), new TemperatureEvent("xyz", 22.1), new TemperatureEvent("xyz", 22.3), new TemperatureEvent("xyz", 22.1), new TemperatureEvent("xyz", 22.4), new TemperatureEvent("xyz", 22.7), new TemperatureEvent("xyz", 27.0)) Pattern API Pattern API allows you to define complex event patterns very easily. Each pattern consists of multiple states. To go from one state to another state, generally we need to define the conditions. The conditions could be continuity or filtered out events. Let's try to understand each pattern operation in detail. Begin The initial state can be defined as follows: In Java: Pattern<Event, ?> start = Pattern.<Event>begin("start"); In Scala: val start : Pattern[Event, _] = Pattern.begin("start") Filter We can also specify the filter condition for the initial state: In Java: start.where(new FilterFunction<Event>() { @Override public boolean filter(Event value) { return ... // condition } }); In Scala: start.where(event => ... /* condition */) Subtype We can also filter out events based on their sub-types, using the subtype() method. In Java: start.subtype(SubEvent.class).where(new FilterFunction<SubEvent>() { @Override public boolean filter(SubEvent value) { return ... // condition } }); In Scala: start.subtype(classOf[SubEvent]).where(subEvent => ... /* condition */) Or Pattern API also allows us define multiple conditions together. We can use OR and AND operators. In Java: pattern.where(new FilterFunction<Event>() { @Override public boolean filter(Event value) { return ... // condition } }).or(new FilterFunction<Event>() { @Override public boolean filter(Event value) { return ... // or condition } }); In Scala: pattern.where(event => ... /* condition */).or(event => ... /* or condition */) Continuity As stated earlier, we do not always need to filter out events. There can always be some pattern where we need continuity instead of filters. Continuity can be of two types – strict continuity and non-strict continuity. Strict continuity Strict continuity needs two events to succeed directly which means there should be no other event in between. This pattern can be defined by next(). In Java: Pattern<Event, ?> strictNext = start.next("middle"); In Scala: val strictNext: Pattern[Event, _] = start.next("middle") Non-strict continuity Non-strict continuity can be stated as other events are allowed to be in between the specific two events. This pattern can be defined by followedBy(). In Java: Pattern<Event, ?> nonStrictNext = start.followedBy("middle"); In Scala: val nonStrictNext : Pattern[Event, _] = start.followedBy("middle") Within Pattern API also allows us to do pattern matching based on time intervals. We can define a time-based temporal constraint as follows. In Java: next.within(Time.seconds(30)); In Scala: next.within(Time.seconds(10)) Detecting patterns To detect the patterns against the stream of events, we need run the stream though the pattern. The CEP.pattern() returns PatternStream. The following code snippet shows how we can detect a pattern. First the pattern is defined to check if temperature value is greater than 26.0 degrees in 10 seconds. In Java: Pattern<TemperatureEvent, ?> warningPattern = Pattern.<TemperatureEvent> begin("first") .subtype(TemperatureEvent.class).where(new FilterFunction<TemperatureEvent>() { public boolean filter(TemperatureEvent value) { if (value.getTemperature() >= 26.0) { return true; } return false; } }).within(Time.seconds(10)); PatternStream<TemperatureEvent> patternStream = CEP.pattern(inputEventStream, warningPattern); In Scala: val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment val input = // data val pattern: Pattern[TempEvent, _] = Pattern.begin("start").where(event => event.temp >= 26.0) val patternStream: PatternStream[TempEvent] = CEP.pattern(input, pattern) Use case – complex event processing on temperature sensor In earlier sections, we learnt various features provided by the Flink CEP engine. Now it's time to understand how we can use it in real-world solutions. For that let's assume we work for a mechanical company which produces some products. In the product factory, there is a need to constantly monitor certain machines. The factory has already set up the sensors which keep on sending the temperature of the machines at a given time. Now we will be setting up a system that constantly monitors the temperature value and generates an alert if the temperature exceeds a certain value. We can use the following architecture: Here we will be using Kafka to collect events from sensors. In order to write a Java application, we first need to create a Maven project and add the following dependency: <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-cep-scala_2.10 --> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-cep-scala_2.10</artifactId> <version>1.1.2</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-java_2.10 --> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-streaming-java_2.10</artifactId> <version>1.1.2</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-scala_2.10 --> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-streaming-scala_2.10</artifactId> <version>1.1.2</version> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-connector-kafka-0.9_2.10</artifactId> <version>1.0.0</version> </dependency> Next we need to do following things for using Kafka. First we need to define a custom Kafka deserializer. This will read bytes from a Kafka topic and convert it into TemperatureEvent. The following is the code to do this. EventDeserializationSchema.java: package com.demo.chapter05; import java.io.IOException; import java.nio.charset.StandardCharsets; import org.apache.flink.api.common.typeinfo.TypeInformation; import org.apache.flink.api.java.typeutils.TypeExtractor; import org.apache.flink.streaming.util.serialization.DeserializationSchema; public class EventDeserializationSchema implements DeserializationSchema<TemperatureEvent> { public TypeInformation<TemperatureEvent> getProducedType() { return TypeExtractor.getForClass(TemperatureEvent.class); } public TemperatureEvent deserialize(byte[] arg0) throws IOException { String str = new String(arg0, StandardCharsets.UTF_8); String[] parts = str.split("="); return new TemperatureEvent(parts[0], Double.parseDouble(parts[1])); } public boolean isEndOfStream(TemperatureEvent arg0) { return false; } } Next we create topics in Kafka called temperature: bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic temperature Now we move to Java code which would listen to these events in Flink streams: StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); Properties properties = new Properties(); properties.setProperty("bootstrap.servers", "localhost:9092"); properties.setProperty("group.id", "test"); DataStream<TemperatureEvent> inputEventStream = env.addSource( new FlinkKafkaConsumer09<TemperatureEvent>("temperature", new EventDeserializationSchema(), properties)); Next we will define the pattern to check if the temperature is greater than 26.0 degrees Celsius within 10 seconds: Pattern<TemperatureEvent, ?> warningPattern = Pattern.<TemperatureEvent> begin("first").subtype(TemperatureEvent.class).where(new FilterFunction<TemperatureEvent>() { private static final long serialVersionUID = 1L; public boolean filter(TemperatureEvent value) { if (value.getTemperature() >= 26.0) { return true; } return false; } }).within(Time.seconds(10)); Next match this pattern with the stream of events and select the event. We will also add up the alert messages into results stream as shown here: DataStream<Alert> patternStream = CEP.pattern(inputEventStream, warningPattern) .select(new PatternSelectFunction<TemperatureEvent, Alert>() { private static final long serialVersionUID = 1L; public Alert select(Map<String, TemperatureEvent> event) throws Exception { return new Alert("Temperature Rise Detected:" + event.get("first").getTemperature() + " on machine name:" + event.get("first").getMachineName()); } }); In order to know the alerts generated, we will print the results: patternStream.print(); And we execute the stream: env.execute("CEP on Temperature Sensor"); Now we are all set to execute the application. So as and when we get messages in Kafka topics, the CEP will keep on executing. The actual execution will looks like the following. Example input: xyz=21.0 xyz=30.0 LogShaft=29.3 Boiler=23.1 Boiler=24.2 Boiler=27.0 Boiler=29.0 Example output: Connected to JobManager at Actor[akka://flink/user/jobmanager_1#1010488393] 10/09/2016 18:15:55 Job execution switched to status RUNNING. 10/09/2016 18:15:55 Source: Custom Source(1/4) switched to SCHEDULED 10/09/2016 18:15:55 Source: Custom Source(1/4) switched to DEPLOYING 10/09/2016 18:15:55 Source: Custom Source(2/4) switched to SCHEDULED 10/09/2016 18:15:55 Source: Custom Source(2/4) switched to DEPLOYING 10/09/2016 18:15:55 Source: Custom Source(3/4) switched to SCHEDULED 10/09/2016 18:15:55 Source: Custom Source(3/4) switched to DEPLOYING 10/09/2016 18:15:55 Source: Custom Source(4/4) switched to SCHEDULED 10/09/2016 18:15:55 Source: Custom Source(4/4) switched to DEPLOYING 10/09/2016 18:15:55 CEPPatternOperator(1/1) switched to SCHEDULED 10/09/2016 18:15:55 CEPPatternOperator(1/1) switched to DEPLOYING 10/09/2016 18:15:55 Map -> Sink: Unnamed(1/4) switched to SCHEDULED 10/09/2016 18:15:55 Map -> Sink: Unnamed(1/4) switched to DEPLOYING 10/09/2016 18:15:55 Map -> Sink: Unnamed(2/4) switched to SCHEDULED 10/09/2016 18:15:55 Map -> Sink: Unnamed(2/4) switched to DEPLOYING 10/09/2016 18:15:55 Map -> Sink: Unnamed(3/4) switched to SCHEDULED 10/09/2016 18:15:55 Map -> Sink: Unnamed(3/4) switched to DEPLOYING 10/09/2016 18:15:55 Map -> Sink: Unnamed(4/4) switched to SCHEDULED 10/09/2016 18:15:55 Map -> Sink: Unnamed(4/4) switched to DEPLOYING 10/09/2016 18:15:55 Source: Custom Source(2/4) switched to RUNNING 10/09/2016 18:15:55 Source: Custom Source(3/4) switched to RUNNING 10/09/2016 18:15:55 Map -> Sink: Unnamed(1/4) switched to RUNNING 10/09/2016 18:15:55 Map -> Sink: Unnamed(2/4) switched to RUNNING 10/09/2016 18:15:55 Map -> Sink: Unnamed(3/4) switched to RUNNING 10/09/2016 18:15:55 Source: Custom Source(4/4) switched to RUNNING 10/09/2016 18:15:55 Source: Custom Source(1/4) switched to RUNNING 10/09/2016 18:15:55 CEPPatternOperator(1/1) switched to RUNNING 10/09/2016 18:15:55 Map -> Sink: Unnamed(4/4) switched to RUNNING 1> Alert [message=Temperature Rise Detected:30.0 on machine name:xyz] 2> Alert [message=Temperature Rise Detected:29.3 on machine name:LogShaft] 3> Alert [message=Temperature Rise Detected:27.0 on machine name:Boiler] 4> Alert [message=Temperature Rise Detected:29.0 on machine name:Boiler] We can also configure a mail client and use some external web hook to send e-mail or messenger notifications. The code for the application can be found on GitHub: https://github.com/deshpandetanmay/mastering-flink. Summary We learnt about complex event processing (CEP). We discussed the challenges involved and how we can use the Flink CEP library to solve CEP problems. We also learnt about Pattern API and the various operators we can use to define the pattern. In the final section, we tried to connect the dots and see one complete use case. With some changes, this setup can be used as it is present in various other domains as well. We will see how to use Flink's built-in Machine Learning library to solve complex problems. Resources for Article: Further resources on this subject: Getting Started with Apache Spark DataFrames [article] Getting Started with Apache Hadoop and Apache Spark [article] Integrating Scala, Groovy, and Flex Development with Apache Maven [article]
Read more
  • 0
  • 0
  • 5599
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $15.99/month. Cancel anytime
article-image-tabular-models
Packt
16 Jan 2017
15 min read
Save for later

Tabular Models

Packt
16 Jan 2017
15 min read
In this article by Derek Wilson, the author of the book Tabular Modeling with SQL Server 2016 Analysis Services Cookbook, you will learn the following recipes: Opening an existing model Importing data Modifying model relationships Modifying model measures Modifying model columns Modifying model hierarchies Creating a calculated table Creating key performance indicators (KPIs) Modifying key performance indicators (KPIs) Deploying a modified model (For more resources related to this topic, see here.) Once the new data is loaded into the model, we will modify various pieces of the model, including adding a new Key Performance Indicator. Next, we will perform calculations to see how to create and modify measures and columns. Opening an existing model We will open the model. To make modifications to your deployed models, we will need to open the model in the Visual Studio designer. How to do it… Open your solution, by navigating to File | Open | Project/Solution. Then select the folder and solution Chapter3_Model and select Open. Your solution is now open and ready for modification. How it works… Visual Studio stores the model as a project inside of a solution. In Chapter 3 we created a new project and saved it as Chapter3_Model. To make modifications to the model we open it in Visual Studio. Importing data The crash data has many columns that store the data in codes. In order to make this data useful for reporting, we need to add description columns. In this section, we will create four code tables by importing data into a SQL Server database. Then, we will add the tables to your existing model. Getting ready In the database on your SQL Server, run the following scripts to create the four tables and populate them with the reference data: Create the Major Cause of Accident Reference Data table: CREATE TABLE [dbo].[MAJCSE_T](   [MAJCSE] [int] NULL,   [MAJOR_CAUSE] [varchar](50) NULL ) ON [PRIMARY] Then, populate the table with data: INSERT INTO MAJCSE_T VALUES (20, 'Overall/rollover'), (21, 'Jackknife'), (31, 'Animal'), (32, 'Non-motorist'), (33, 'Vehicle in Traffic'), (35, 'Parked motor vehicle'), (37, 'Railway vehicle'), (40, 'Collision with bridge'), (41, 'Collision with bridge pier'), (43, 'Collision with curb'), (44, 'Collision with ditch'), (47, 'Collision culvert'), (48, 'Collision Guardrail - face'), (50, 'Collision traffic barrier'), (53, 'impact with Attenuator'), (54, 'Collision with utility pole'), (55, 'Collision with traffic sign'), (59, 'Collision with mailbox'), (60, 'Collision with Tree'), (70, 'Fire'), (71, 'Immersion'), (72, 'Hit and Run'), (99, 'Unknown') Create the table to store the lighting conditions at the time of the crash: CREATE TABLE [dbo].[LIGHT_T](   [LIGHT] [int] NULL,   [LIGHT_CONDITION] [varchar](30) NULL ) ON [PRIMARY] Now, populate the data that shows the descriptions for the codes: INSERT INTO LIGHT_T VALUES (1, 'Daylight'), (2, 'Dusk'), (3, 'Dawn'), (4, 'Dark, roadway lighted'), (5, 'Dark, roadway not lighted'), (6, 'Dark, unknown lighting'), (9, 'Unknown') Create the table to store the road conditions: CREATE TABLE [dbo].[CSRFCND_T](   [CSRFCND] [int] NULL,   [SURFACE_CONDITION] [varchar](50) NULL ) ON [PRIMARY] Now populate the road condition descriptions: INSERT INTO CSRFCND_T VALUES (1, 'Dry'), (2, 'Wet'), (3, 'Ice'), (4, 'Snow'), (5, 'Slush'), (6, 'Sand, Mud'), (7, 'Water'), (99, 'Unknown') Finally, create the weather table: CREATE TABLE [dbo].[WEATHER_T](   [WEATHER] [int] NULL,   [WEATHER_CONDITION] [varchar](30) NULL ) ON [PRIMARY] Then populate the weather condition descriptions. INSERT INTO WEATHER_T VALUES (1, 'Clear'), (2, 'Partly Cloudy'), (3, 'Cloudy'), (5, 'Mist'), (6, 'Rain'), (7, 'Sleet, hail, freezing rain'), (9, 'Severe winds'), (10, 'Blowing Sand'), (99, 'Unknown') You now have the tables and data required to complete the recipes in this chapter. How to do it… From your open model, change to the Diagram view in model.bim. Navigate to Model | Import from Data Source then select Microsoft SQL Server on the Table Import Wizard and click on Next. Set your Server Name to Localhost and change the Database name to Chapter3 and click on Next. Enter your admin account username and password and click on Next. You want to select from a list of tables the four tables that were created at the beginning. Click on Finish to import the data. How it works… This recipe opens the table import wizard and allows us to select the four new tables that are to be added to the existing model. The data is then imported into your Tabular Model workspace. Once imported, the data is now ready to be used to enhance the model. Modifying model relationships We will create the necessary relationships for the new tables. These relationships will be used in the model in order for the SSAS engine to perform correct calculations. How to do it… Open your model to the diagram view and you will see the four tables that you imported from the previous recipe. Select the CSRFCND field in the CSRFCND_T table and drag the CSRFCND table in the Crash_Data table. Select the LIGHT field in the LIGHT_T table and drag to the LIGHT table in the Crash_Data table. Select the MAJCSE field in the MAJCSE_T table and drag to the MAJCSE table in the Crash_Data table. Select the WEATHER field in the WEATHER_T table and drag to the WEATHER table in the Crash_Data table. How it works… Each table in this section has a relationship built between the code columns and the Crash_Data table corresponding columns. These relationships allow for DAX calculations to be applied across the data tables. Modifying model measures Now that there are more tables in the model, we are going to add an additional measure to perform quick calculations on data. The measure will use a simple DAX calculation since it is focused on how to add or modify the model measures. How to do it… Open the Chapter 3 model project to the Model.bim folder and make sure you are in grid view. Select the cell under Count_of_Crashes and in the fx bar add the following DAX formula to create Sum_of_Fatalities: Sum_of_Fatalities:=SUM(Crash_Data[FATALITIES]) Then, hit Enter to create the calculation: In the properties window, enter Injury_Calculations in the Display Folder. Then, change the Format to Whole Number and change the Show Thousand Separator to True. Finally, add to Description Total Number of Fatalities Recorded: How it works… In this recipe, we added a new measure to the existing model that calculates the total number of fatalities on the Crash_Data table. Then we added a new folder for the users to see the calculation. We also modified the default behavior of the calculation to display as a whole number and show commas to make the numbers easier to interpret. Finally, we added a description to the calculation that users will be able to see in the reporting tools. If we did not make these changes in the model, each user will be required to make the changes each time they accessed the model. By placing the changes in the model, everyone will see the data in the same format. Modifying model columns We will modify the properties of the columns on the WEATHER table. Modifications to the columns in a table make the information easier for your users to understand in the reporting tools. Some properties determine how the SSAS engine uses the fields when creating the model on the server. How to do it… In Model.bim, make sure you are in the grid view and change to the WEATHER_T tab. Select WEATHER column to view the available Properties and make the following changes: Hiddenproperty to True  Uniqueproperty to True Sort By ColumnselectWEATHER_CONDITION Summarize By to Count Next, select the WEATHER_CONDITION column and modify the following properties. Description add Weather at time of crash Default Labelproperty to True How it works… This recipe modified the properties of the measure to make it better for your report users to access the data. The WEATHER code column was hidden so it will not be visible in the reporting tools and the WEATHER_CONDITION was sorted in alphabetical order. You set the default aggregation to Count and then added a description for the column. Now, when this dimension is added to a report only the WEATHER_CONDITION column will be seen and pre-sorted based on the WEATHER_CONDITION field. It will also use count as the aggregation type to provide the number of each type of weather conditions. If you were to add another new description to the table, it would automatically be sorted correctly. Modifying model hierarchies Once you have created a hierarchy, you may want to remove or modify the hierarchy from your model. We will make modifications to the Calendar_YQMD hierarchy. How to do it… Open Model.bim to the diagram view and find the Master_Calendar_T table. Review the Calendar_YQMD hierarchy and included columns. Select the Quarter_Name column and right-click on it to bring up the menu. Select Remove from Hierarchy to delete Quarter_Name from the hierarchy and confirm on the next screen by selecting Remove from Hierarchy. Select the Calendar_YQMD hierarchy and right-click on it and select Rename. Change the name to Calendar_YMD and hit on Enter. How it works… In this recipe, we opened the diagram view and selected the Master_Calendar_T table to find the existing hierarchy. After selecting the Quarter_Name column in the hierarchy, we used the menus to view the available options for modifications. Then we selected the option to remove the column from the hierarchy. Finally, we updated the name of the hierarchy to let users know that the quarter column is not included. There’s more… Another option to remove fields from the hierarchy is to select the column and then press the delete key. Likewise, you can double-click on the Calendar_YQMD hierarchy to bring up the edit window for the name. Then edit the name and hit Enter to save the change in the designer. Creating a calculated table Calculated tables are created dynamically using functions or DAX queries. They are very useful if you need to create a new table based on information in another table. For example, you could have a date table with 30 years of data. However, most of your users only look at the last five years of information when running most of their analysis. Instead of creating a new table you can dynamically make a new table that only stores the last five years of dates. You will use a single DAX query to filter the Master_Calendar_T table to the last 5 years of data. How to do it… OpenModel.bim to the grid view and then select the Table menu and New Calculated Table. A new data tab is created. In the function box, enter this DAX formula to create a date calendar for the last 5 years: FILTER(MasterCalendar_T, MasterCalendar_T[Date]>=DATEADD(MasterCalendar_T[Date],6,YEAR)) Double-click on the CalculatedTable 1 tab and rename to Last_5_Years_T. How it works… It works by creating a new table in the model that is built from a DAX formula. In order to limit the number of years shown, the DAX formula reduces the total number of dates available for the last 5 years of dates. There’s more… After you create a calculated table, you will need to create the necessary relationships and hierarchies just like a regular table: Switch to the diagram view in the model.bim and you will be able to see the new table. Create a new hierarchy and name it Last_5_Years_YQM and include Year, Quarter_Name, Month_Name, and Date Replace the Master_Calendar_T relationship with the Date column from the Last_5_Years_T date column to the Crash_Date.Crash_Date column. Now, the model will only display the last 5 years of crash data when using the Last_5_Years_T table in the reporting tools. The Crash_Data table still contains all of the records if you need to view more than 5 years of data. Creating key performance indicators (KPIs) Key performance indicators are business metrics that show the effectiveness of a business objective. They are used to track actual performance against budgeted or planned value such as Service Level Agreements or On-Time performance. The advantage of creating a KPI is the ability to quickly see the actual value compared to the target value. To add a KPI, you will need to have a measure to use as the actual and another measure that returns the target value. In this recipe, we will create a KPI that tracks the number of fatalities and compares them to the prior year with the goal of having fewer fatalities each year. How to do it… Open the Model.bim to the grid view and select an empty cell and create a new measure named Last_Year_Fatalities:Last_Year_Fatalities:=CALCULATE(SUM(Crash_Data[FATALITIES]),DATEADD(MasterCalendar_T[Date],-1, YEAR)) Select the already existing Sum_of_measure then right-click and select Create KPI…. On the Key Performance Indicator (KPI) window, select Last_Year_Fatalities as the Target Measure. Then, select the second set of icons that have red, yellow, and green with symbols. Finally, change the KPI color scheme to green, yellow, and red and make the scores 90 and 97, and then click on OK. The Sum_of_Fatalites measure will now have a small graph next to it in the measure grid to show that there is a KPI on that measure. How it works… You created a new calculation that compared the actual count of fatalities compared to the same number for the prior year. Then you created a new KPI that used the actual and Last_Year_Fatalities measure. In the KPI window, you setup thresholds to determine when a KPI is red, yellow, or green. For this example, you want to show that having less fatalities year over year is better. Therefore, when the KPI is 97% or higher the KPI will show red. For values that are in the range of 90% to 97% the KPI is yellow and anything below 90% is green. By selecting the icons with both color and symbols, users that are color-blind can still determine the appropriate symbol of the KPI. Modifying key performance indicators (KPIs) Once you have created a KPI, you may want to remove or modify the KPI from your model. You will make modifications to the Last_Year_Fatalities hierarchy. How to do it… Open Model.bim to the Grid view and select the Sum_of_Fatalities measure then right-click to bring up Edit KPI settings…. Edit the appropriate settings to modify an existing KPI. How it works… Just like models, KPIs will need to be modified after being initially designed. The icon next to a measure denotes that a KPI is defined on the measure. Right-clicking on the measure brings up the menu that allows you to enter the Edit KPI setting. Deploying a modified model Once you have completed the changes to your model, you have two options for deployment. First, you can deploy the model and replace the existing model. Alternatively, you can change the name of your model and deploy it as a new model. This is often useful when you need to test changes and maintain the existing model as is. How to do it… Open the Chapter3_model project in Visual Studio. Select the Project menu and select Chapter3_Model Properties… to bring up the Properties menu and review the Server and Database properties. To overwrite an existing model make no changes and click on OK. Select the Build menu from the Chapter3_Model project and select the Deploy Chapter3_Model option. On the following screens, enter the impersonation credentials for your data and hit OK to deploy the changes. How it works… the model that is on your local machine and submits the changes to the server. By not making any changes to the existing model properties, a new deployment will overwrite the old model. All of your changes are now published on the server and users can begin to leverage the changes. There’s more… Sometimes you might want to deploy your model to a different database without overwriting the existing environment. This could be to try out a new model or test different functionality with users that you might want to implement. You can modify the properties of the project to deploy to a different server such as development, UAT, or production. Likewise, you can also change the database name to deploy the model to the same server or different servers for testing. Open the Project menu and then select Chapter3_Model Properties. Change the name of the Database to Chapter4_Model and click on OK. Next, on the Build menu, select Deploy Chapter3_Model to deploy the model to the same server under the new name of Chapter4_Model. When you review the Analysis Services databases in SQL Server Management Studio, you will now see a database for Chapter3_Model and Chapter4_Model. Summary After building a model, we will need to maintain and enhance the model as the business users update or change their requirements. We will begin by adding additional tables to the model that contain the descriptive data columns for several code columns. Then we will create relationships between these new tables and the existing data tables. Resources for Article: Further resources on this subject: Say Hi to Tableau [article] Data Tables and DataTables Plugin in jQuery 1.3 with PHP [article] Data Science with R [article]
Read more
  • 0
  • 0
  • 1878

article-image-ml-package
Packt
11 Jan 2017
18 min read
Save for later

ML Package

Packt
11 Jan 2017
18 min read
In this article by Denny Lee, the author of the book Learning PySpark, has provided a brief implementation and theory on ML packages. So, let's get to it! In this article, we will reuse a portion of the dataset. The data can be downloaded from http://www.tomdrabas.com/data/LearningPySpark/births_transformed.csv.gz. (For more resources related to this topic, see here.) Overview of the package At the top level, the package exposes three main abstract classes: a Transformer, an Estimator, and a Pipeline. We will shortly explain each with some short examples. Transformer The Transformer class, like the name suggests, transforms your data by (normally) appending a new column to your DataFrame. At the high level, when deriving from the Transformer abstract class, each and every new Transformer needs to implement a .transform(...) method. The method, as a first and normally the only obligatory parameter, requires passing a DataFrame to be transformed. This, of course, varies method-by-method in the ML package: other popular parameters are inputCol and outputCol; these, however, frequently default to some predefined values, such as 'features' for the inputCol parameter. There are many Transformers offered in the spark.ml.feature and we will briefly describe them here: Binarizer: Given a threshold, the method takes a continuous variable and transforms it into a binary one. Bucketizer: Similar to the Binarizer, this method takes a list of thresholds (the splits parameter) and transforms a continuous variable into a multinomial one. ChiSqSelector: For the categorical target variables (think, classification models), the feature allows you to select a predefined number of features (parameterized by the numTopFeatures parameter) that explain the variance in the target the best. The selection is done, as the name of the method suggest using a Chi-Square test. It is one of the two-step methods: first, you need to .fit(...) your data (so the method can calculate the Chi-square tests). Calling the .fit(...) method (you pass your DataFrame as a parameter) returns a ChiSqSelectorModel object that you can then use to transform your DataFrame using the .transform(...) method. More information on Chi-square can be found here: http://ccnmtl.columbia.edu/projects/qmss/the_chisquare_test/about_the_chisquare_test.html. CountVectorizer: Useful for a tokenized text (such as [['Learning', 'PySpark', 'with', 'us'],['us', 'us', 'us']]). It is of two-step methods: first, you need to .fit(...), that is, learn the patterns from your dataset, before you can .transform(...) with the CountVectorizerModel returned by the .fit(...) method. The output from this transformer, for the tokenized text presented previously, would look similar to this: [(4, [0, 1, 2, 3], [1.0, 1.0, 1.0, 1.0]),(4, [3], [3.0])]. DCT: The Discrete Cosine Transform takes a vector of real values and returns a vector of the same length, but with the sum of cosine functions oscillating at different frequencies. Such transformations are useful to extract some underlying frequencies in your data or in data compression. ElementwiseProduct: A method that returns a vector with elements that are products of the vector passed to the method, and a vector passed as the scalingVec parameter. For example, if you had a [10.0, 3.0, 15.0] vector and your scalingVec was [0.99, 3.30, 0.66], then the vector you would get would look as follows: [9.9, 9.9, 9.9]. HashingTF: A hashing trick transformer that takes a list of tokenized text and returns a vector (of predefined length) with counts. From PySpark's documentation: Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the columns. IDF: The method computes an Inverse Document Frequency for a list of documents. Note that the documents need to already be represented as a vector (for example, using either the HashingTF or CountVectorizer). IndexToString: A complement to the StringIndexer method. It uses the encoding from the StringIndexerModel object to reverse the string index to original values. MaxAbsScaler: Rescales the data to be within the [-1, 1] range (thus, does not shift the center of the data). MinMaxScaler: Similar to the MaxAbsScaler with the difference that it scales the data to be in the [0.0, 1.0] range. NGram: The method that takes a list of tokenized text and returns n-grams: pairs, triples, or n-mores of subsequent words. For example, if you had a ['good', 'morning', 'Robin', 'Williams'] vector you would get the following output: ['good morning', 'morning Robin', 'Robin Williams']. Normalizer: A method that scales the data to be of unit norm using the p-norm value (by default, it is L2). OneHotEncoder: A method that encodes a categorical column to a column of binary vectors. PCA: Performs the data reduction using principal component analysis. PolynomialExpansion: Performs a polynomial expansion of a vector. For example, if you had a vector symbolically written as [x, y, z], the method would produce the following expansion: [x, x*x, y, x*y, y*y, z, x*z, y*z, z*z]. QuantileDiscretizer: Similar to the Bucketizer method, but instead of passing the splits parameter you pass the numBuckets one. The method then decides, by calculating approximate quantiles over your data, what the splits should be. RegexTokenizer: String tokenizer using regular expressions. RFormula: For those of you who are avid R users - you can pass a formula such as vec ~ alpha * 3 + beta (assuming your DataFrame has the alpha and beta columns) and it will produce the vec column given the expression. SQLTransformer: Similar to the previous, but instead of R-like formulas you can use SQL syntax. The FROM statement should be selecting from __THIS__ indicating you are accessing the DataFrame. For example: SELECT alpha * 3 + beta AS vec FROM __THIS__. StandardScaler: Standardizes the column to have 0 mean and standard deviation equal to 1. StopWordsRemover: Removes stop words (such as 'the' or 'a') from a tokenized text. StringIndexer: Given a list of all the words in a column, it will produce a vector of indices. Tokenizer: Default tokenizer that converts the string to lower case and then splits on space(s). VectorAssembler: A highly useful transformer that collates multiple numeric (vectors included) columns into a single column with a vector representation. For example, if you had three columns in your DataFrame: df = spark.createDataFrame( [(12, 10, 3), (1, 4, 2)], ['a', 'b', 'c']) The output of calling: ft.VectorAssembler(inputCols=['a', 'b', 'c'], outputCol='features') .transform(df) .select('features') .collect() It would look as follows: [Row(features=DenseVector([12.0, 10.0, 3.0])), Row(features=DenseVector([1.0, 4.0, 2.0]))] VectorIndexer: A method for indexing categorical column into a vector of indices. It works in a column-by-column fashion, selecting distinct values from the column, sorting and returning an index of the value from the map instead of the original value. VectorSlicer: Works on a feature vector, either dense or sparse: given a list of indices it extracts the values from the feature vector. Word2Vec: The method takes a sentence (string) as an input and transforms it into a map of {string, vector} format, a representation that is useful in natural language processing. Note that there are many methods in the ML package that have an E letter next to it; this means the method is currently in beta (or Experimental) and it sometimes might fail or produce erroneous results. Beware. Estimators Estimators can be thought of as statistical models that need to be estimated to make predictions or classify your observations. If deriving from the abstract Estimator class, the new model has to implement the .fit(...) method that fits the model given the data found in a DataFrame and some default or user-specified parameters. There are a lot of estimators available in PySpark and we will now shortly describe the models available in Spark 2.0. Classification The ML package provides a data scientist with seven classification models to choose from. These range from the simplest ones (such as Logistic Regression) to more sophisticated ones. We will provide short descriptions of each of them in the following section: LogisticRegression: The benchmark model for classification. The logistic regression uses logit function to calculate the probability of an observation belonging to a particular class. At the time of writing, the PySpark ML supports only binary classification problems. DecisionTreeClassifier: A classifier that builds a decision tree to predict a class for an observation. Specifying the maxDepth parameter limits the depth the tree grows, the minInstancePerNode determines the minimum number of observations in the tree node required to further split, the maxBins parameter specifies the maximum number of bins the continuous variables will be split into, and the impurity specifies the metric to measure and calculate the information gain from the split. GBTClassifier: A Gradient Boosted Trees classification model for classification. The model belongs to the family of ensemble models: models that combine multiple weak predictive models to form a strong one. At the moment the GBTClassifier model supports binary labels, and continuous and categorical features. RandomForestClassifier: The models produce multiple decision trees (hence the name - forest) and use the mode output of those decision trees to classify observations. The RandomForestClassifier supports both binary and multinomial labels. NaiveBayes: Based on the Bayes' theorem, this model uses conditional probability theory to classify observations. The NaiveBayes model in PySpark ML supports both binary and multinomial labels. MultilayerPerceptronClassifier: A classifier that mimics the nature of a human brain. Deeply rooted in the Artificial Neural Networks theory, the model is a black-box, that is, it is not easy to interpret the internal parameters of the model. The model consists, at a minimum, of three, fully connected layers (a parameter that needs to be specified when creating the model object) of artificial neurons: the input layer (that needs to be equal to the number of features in your dataset), a number of hidden layers (at least one), and an output layer with the number of neurons equal to the number of categories in your label. All the neurons in the input and hidden layers have a sigmoid activation function, whereas the activation function of the neurons in the output layer is softmax. OneVsRest: A reduction of a multiclass classification to a binary one. For example, in the case of a multinomial label, the model can train multiple binary logistic regression models. For example, if label == 2 the model will build a logistic regression where it will convert the label == 2 to 1 (or else label values would be set to 0) and then train a binary model. All the models are then scored and the model with the highest probability wins. Regression There are seven models available for regression tasks in the PySpark ML package. As with classification, these range from some basic ones (such as obligatory Linear Regression) to more complex ones: AFTSurvivalRegression: Fits an Accelerated Failure Time regression model; It is a parametric model that assumes that a marginal effect of one of the features accelerates or decelerates a life expectancy (or process failure). It is highly applicable for the processes with well-defined stages. DecisionTreeRegressor: Similar to the model for classification with an obvious distinction that the label is continuous instead of binary (or multinomial). GBTRegressor: As with the DecisionTreeRegressor, the difference is the data type of the label. GeneralizedLinearRegression: A family of linear models with differing kernel functions (link functions). In contrast to the linear regression that assumes normality of error terms, the GLM allows the label to have different error term distributions: the GeneralizedLinearRegression model from the PySpark ML package supports gaussian, binomial, gamma, and poisson families of error distributions with a host of different link functions. IsotonicRegression: A type of regression that fits a free-form, non-decreasing line to your data. It is useful to fit the datasets with ordered and increasing observations. LinearRegression: The most simple of regression models, assumes linear relationship between features and a continuous label, and normality of error terms. RandomForestRegressor: Similar to either DecisionTreeRegressor or GBTRegressor, the RandomForestRegressor fits a continuous label instead of a discrete one. Clustering Clustering is a family of unsupervised models that is used to find underlying patterns in your data. The PySpark ML package provides four most popular models at the moment: BisectingKMeans: A combination of k-means clustering method and hierarchical clustering. The algorithm begins with all observations in a single cluster and iteratively splits the data into k clusters. Check out this website for more information on pseudo-algorithm: http://minethedata.blogspot.com/2012/08/bisecting-k-means.html. KMeans: It is the famous k-mean algorithm that separates data into k clusters, iteratively searching for centroids that minimize the sum of square distances between each observation and the centroid of the cluster it belongs to. GaussianMixture: This method uses k Gaussian distributions with unknown parameters to dissect the dataset. Using the Expectation-Maximization algorithm, the parameters for the Gaussians are found by maximizing the log-likelihood function. Beware that for datasets with many features this model might perform poorly due to the curse of dimensionality and numerical issues with Gaussian distributions. LDA: This model is used for topic modeling in natural language processing applications. There is also one recommendation model available in PySpark ML, but I will refrain from describing it here. Pipeline A Pipeline in PySpark ML is a concept of an end-to-end transformation-estimation process (with distinct stages) that ingests some raw data (in a DataFrame form), performs necessary data carpentry (transformations), and finally estimates a statistical model (estimator). A Pipeline can be purely transformative, that is, consisting of Transformers only. A Pipeline can be thought of as a chain of multiple discrete stages. When a .fit(...) method is executed on a Pipeline object, all the stages are executed in the order they were specified in the stages parameter; the stages parameter is a list of Transformer and Estimator objects. The .fit(...) method of the Pipeline object executes the .transform(...) method for the Transformers and the .fit(...) method for the Estimators. Normally, the output of a preceding stage becomes the input for the following stage: when deriving from either the Transformer or Estimator abstract classes, one needs to implement the .getOutputCol() method that returns the value of the outputCol parameter specified when creating an object. Predicting chances of infant survival with ML In this section, we will use the portion of the dataset to present the ideas of PySpark ML. If you have not yet downloaded the data, it can be accessed here: http://www.tomdrabas.com/data/LearningPySpark/births_transformed.csv.gz. In this section, we will, once again, attempt to predict the chances of the survival of an infant. Loading the data First, we load the data with the help of the following code: import pyspark.sql.types as typ labels = [ ('INFANT_ALIVE_AT_REPORT', typ.IntegerType()), ('BIRTH_PLACE', typ.StringType()), ('MOTHER_AGE_YEARS', typ.IntegerType()), ('FATHER_COMBINED_AGE', typ.IntegerType()), ('CIG_BEFORE', typ.IntegerType()), ('CIG_1_TRI', typ.IntegerType()), ('CIG_2_TRI', typ.IntegerType()), ('CIG_3_TRI', typ.IntegerType()), ('MOTHER_HEIGHT_IN', typ.IntegerType()), ('MOTHER_PRE_WEIGHT', typ.IntegerType()), ('MOTHER_DELIVERY_WEIGHT', typ.IntegerType()), ('MOTHER_WEIGHT_GAIN', typ.IntegerType()), ('DIABETES_PRE', typ.IntegerType()), ('DIABETES_GEST', typ.IntegerType()), ('HYP_TENS_PRE', typ.IntegerType()), ('HYP_TENS_GEST', typ.IntegerType()), ('PREV_BIRTH_PRETERM', typ.IntegerType()) ] schema = typ.StructType([ typ.StructField(e[0], e[1], False) for e in labels ]) births = spark.read.csv('births_transformed.csv.gz', header=True, schema=schema) We specify the schema of the DataFrame; our severely limited dataset now only has 17 columns. Creating transformers Before we can use the dataset to estimate a model, we need to do some transformations. Since statistical models can only operate on numeric data, we will have to encode the BIRTH_PLACE variable. Before we do any of this, since we will use a number of different feature transformations. Let's import them all: import pyspark.ml.feature as ft To encode the BIRTH_PLACE column, we will use the OneHotEncoder method. However, the method cannot accept StringType columns - it can only deal with numeric types so first we will cast the column to an IntegerType: births = births .withColumn( 'BIRTH_PLACE_INT', births['BIRTH_PLACE'] .cast(typ.IntegerType())) Having done this, we can now create our first Transformer: encoder = ft.OneHotEncoder( inputCol='BIRTH_PLACE_INT', outputCol='BIRTH_PLACE_VEC') Let's now create a single column with all the features collated together. We will use the VectorAssembler method: featuresCreator = ft.VectorAssembler( inputCols=[ col[0] for col in labels[2:]] + [encoder.getOutputCol()], outputCol='features' ) The inputCols parameter passed to the VectorAssembler object is a list of all the columns to be combined together to form the outputCol - the 'features'. Note that we use the output of the encoder object (by calling the .getOutputCol() method), so we do not have to remember to change this parameter's value should we change the name of the output column in the encoder object at any point. It's now time to create our first estimator. Creating an estimator In this example, we will (once again) use the Logistic Regression model. However, we will showcase some more complex models from the .classification set of PySpark ML models, so we load the whole section: import pyspark.ml.classification as cl Once loaded, let's create the model by using the following code: logistic = cl.LogisticRegression( maxIter=10, regParam=0.01, labelCol='INFANT_ALIVE_AT_REPORT') We would not have to specify the labelCol parameter if our target column had the name 'label'. Also, if the output of our featuresCreator would not be called 'features' we would have to specify the featuresCol by (most conveniently) calling the getOutputCol() method on the featuresCreator object. Creating a pipeline All that is left now is to create a Pipeline and fit the model. First, let's load the Pipeline from the ML package: from pyspark.ml import Pipeline Creating Pipeline is really easy. Here's how our pipeline should look like conceptually: Converting this structure into a Pipeline is a walk in the park: pipeline = Pipeline(stages=[ encoder, featuresCreator, logistic ]) That's it! Our pipeline is now created so we can (finally!) estimate the model. Fitting the model Before you fit the model we need to split our dataset into training and testing datasets. Conveniently, the DataFrame API has the .randomSplit(...) method: births_train, births_test = births .randomSplit([0.7, 0.3], seed=666) The first parameter is a list of dataset proportions that should end up in, respectively, births_train and births_test subsets. The seed parameter provides a seed to the randomizer. You can also split the dataset into more than two subsets as long as the elements of the list sum up to 1, and you unpack the output into as many subsets. For example, we could split the births dataset into three subsets like this: train, test, val = births. randomSplit([0.7, 0.2, 0.1], seed=666) The preceding code would put a random 70% of the births dataset into the train object, 20% would go to the test, and the val DataFrame would hold the remaining 10%. Now it is about time to finally run our pipeline and estimate our model: model = pipeline.fit(births_train) test_model = model.transform(births_test) The .fit(...) method of the pipeline object takes our training dataset as an input. Under the hood, the births_train dataset is passed first to the encoder object. The DataFrame that is created at the encoder stage then gets passed to the featuresCreator that creates the 'features' column. Finally, the output from this stage is passed to the logistic object that estimates the final model. The .fit(...) method returns the PipelineModel object (the model object in the preceding snippet) that can then be used for prediction; we attain this by calling the .transform(...) method and passing the testing dataset created earlier. Here's what the test_model looks like in the following command: test_model.take(1) It generates the following output: As you can see, we get all the columns from the Transfomers and Estimators. The logistic regression model outputs several columns: the rawPrediction is the value of the linear combination of features and the β coefficients, probability is the calculated probability for each of the classes, and finally, the prediction, which is our final class assignment. Evaluating the performance of the model Obviously, we would like to now test how well our model did. PySpark exposes a number of evaluation methods for classification and regression in the .evaluation section of the package: import pyspark.ml.evaluation as ev We will use the BinaryClassficationEvaluator to test how well our model performed: evaluator = ev.BinaryClassificationEvaluator( rawPredictionCol='probability', labelCol='INFANT_ALIVE_AT_REPORT') The rawPredictionCol can either be the rawPrediction column produced by the estimator or the probability. Let's see how well our model performed: print(evaluator.evaluate(test_model, {evaluator.metricName: 'areaUnderROC'})) print(evaluator.evaluate(test_model, {evaluator.metricName: 'areaUnderPR'})) The preceding code produces the following result: The area under the ROC of 74% and area under PR 71% shows a well-defined model, but nothing out of extraordinary; if we had other features, we could drive this up. Saving the model PySpark allows you to save the Pipeline definition for later use. It not only saves the pipeline structure, but also all the definitions of all the Transformers and Estimators: pipelinePath = './infant_oneHotEncoder_Logistic_Pipeline' pipeline.write().overwrite().save(pipelinePath) So, you can load it up later and use it straight away to .fit(...) and predict: loadedPipeline = Pipeline.load(pipelinePath) loadedPipeline .fit(births_train) .transform(births_test) .take(1) The preceding code produces the same result (as expected): Summary Hence we studied ML package. We explained what Transformer and Estimator are, and showed their role in another concept introduced in the ML library: the Pipeline. Subsequently, we also presented how to use some of the methods to fine-tune the hyper parameters of models. Finally, we gave some examples of how to use some of the feature extractors and models from the library. Resources for Article: Further resources on this subject: Package Management [article] Everything in a Package with concrete5 [article] Writing a Package in Python [article]
Read more
  • 0
  • 0
  • 2065

article-image-metric-analytics-metricbeat
Packt
11 Jan 2017
5 min read
Save for later

Metric Analytics with Metricbeat

Packt
11 Jan 2017
5 min read
In this article by Bahaaldine Azarmi, the author of the book Learning Kibana 5.0, we will learn about metric analytics, which is fundamentally different in terms of data structure. (For more resources related to this topic, see here.) Author would like to spend a few lines on the following question: What is a metric? A metric is an event that contains a timestamp and usually one or more numeric values. It is appended to a metric file sequentially, where all lines of metrics are ordered based on the timestamp. As an example, here are a few system metrics: 02:30:00 AM    all    2.58    0.00    0.70    1.12    0.05     95.5502:40:00 AM    all    2.56    0.00    0.69    1.05    0.04     95.6602:50:00 AM    all    2.64    0.00    0.65    1.15    0.05     95.50 Unlike logs, metrics are sent periodically, for example, every 10 minutes (as the preceding example illustrates) whereas logs are usually appended to the log file when something happens. Metrics are often used in the context of software or hardware health monitoring, such as resource utilization monitoring, database execution metrics monitoring, and so on. Since version 5.0, Elastic had, at all layers of the solutions, new features to enhance the user experience of metrics management and analytics. Metricbeat is one of the new features in 5.0. It allows the user to ship metrics data, whether from the machine or from applications, to Elasticsearch, and comes with out-of-the-box dashboards for Kibana. Kibana also integrates Timelion with its core, a plugin which has been made for manipulating numeric data, such as metrics. In this article, we'll start by working with Metricbeat. Metricbeat in Kibana The procedure to import the dashboard has been laid out in the subsequent section. Importing the dashboard Before importing the dashboard, let's have a look at the actual metric data that Metricbeat ships. As I have Chrome opened while typing this article, I'm going to filter the data by process name, here chrome: Discover tab filtered by process name   Here is an example of one of the documents I have: { "_index": "metricbeat-2016.09.06", "_type": "metricsets", "_id": "AVcBFstEVDHwfzZYZHB8", "_score": 4.29527, "_source": { "@timestamp": "2016-09-06T20:00:53.545Z", "beat": { "hostname": "MacBook-Pro-de-Bahaaldine.local", "name": "MacBook-Pro-de-Bahaaldine.local" }, "metricset": { "module": "system", "name": "process", "rtt": 5916 }, "system": { "process": { "cmdline": "/Applications/Google Chrome.app/Contents/Versions/52.0.2743.116/Google Chrome Helper.app/Contents/MacOS/Google Chrome Helper --type=ppapi --channel=55142.2188.1032368744 --ppapi-flash-args --lang=fr", "cpu": { "start_time": "09:52", "total": { "pct": 0.0035 } }, "memory": { "rss": { "bytes": 67813376, "pct": 0.0039 }, "share": 0, "size": 3355303936 }, "name": "Google Chrome H", "pid": 76273, "ppid": 55142, "state": "running", "username": "bahaaldine" } }, "type": "metricsets" }, "fields": { "@timestamp": [ 1473192053545 ] } } Metricbeat document example The preceding document breaks down the utilization of resources for the chrome process. We can see, for example, the usage of CPU and memory, as well as the state of the process as a whole. Now how about visualizing the data in an actual dashboard? To do so, go into the Kibana folder located in the Metricbeat installation directory: MacBook-Pro-de-Bahaaldine:kibana bahaaldine$ pwd /elastic/metricbeat-5.0.0/kibana MacBook-Pro-de-Bahaaldine:kibana bahaaldine$ ls dashboard import_dashboards.ps1 import_dashboards.sh index-pattern search visualization import_dashboards.sh is the file we will use to import the dashboards in Kibana. Execute the file script like the following: ./import_dashboards.sh –h This should print out the help, which, essentially, will give you the list of arguments you can pass to the script. Here, we need to specify a username and a password as we are using the X-Pack security plugin, which secures our cluster: ./import_dashboards.sh –u elastic:changeme You should normally get a bunch of logs stating that dashboards have been imported, as shown in the following example: Import visualization Servers-overview: {"_index":".kibana","_type":"visualization","_id":"Servers-overview","_version":4,"forced_refresh":false,"_shards":{"total":2,"successful":1,"failed":0},"created":false} Now, at this point, you have metric data in Elasticsearch and dashboards created in Kibana, so you can now visualize the data. Visualizing metrics If you go back into the Kibana/dashboard section and try to open the Metricbeat System Statistics dashboard, you should get something similar to the following: Metricbeat Kibana dashboard You should see in your own dashboard the metric based on the processes that are running on your computer. In my case, I have a bunch of them for which I can visualize the CPU and memory utilization, for example: RAM and CPU utilization As an example, what can be important here is to be sure that Metricbeat has a very low footprint on the overall system in terms of CPU or RAM, as shown here: Metricbeat resource utilization As we can see in the preceding diagram, Metricbeat only uses about 0.4% of the CPU and less than 0.1% of the memory on my Macbook Pro. On the other hand, if I want to get the most resource-consuming processes, I can check in the Top processes data table, which gives the following information: Top processes Besides Google Chrome H, which uses a lot of CPU, zoom.us, a conferencing application, seems to bring a lot of stress to my laptop. Rather than using the Kibana standard visualization to manipulate our metrics, we'll use Timelion instead, and focus on this heavy CPU consuming processes use case. Summary In this article, we have seen how we can use Kibana in the context of technical metric analytics. We relied on the data that Metricbeat is able to ship from a machine and visualized the result both in Kibana dashboard and in Kibana Timelion. Resources for Article: Further resources on this subject: An Introduction to Kibana [article] Big Data Analytics [article] Static Data Management [article]
Read more
  • 0
  • 0
  • 2772

article-image-visualization-dashboard-design
Packt
10 Jan 2017
18 min read
Save for later

Visualization Dashboard Design

Packt
10 Jan 2017
18 min read
In this article by David Baldwin, the author of the book Mastering Tableau, we will cover how you need to create some effective dashboards. (For more resources related to this topic, see here.) Since that fateful week in Manhattan, I've read Edward Tufte, Stephen Few, and other thought leaders in the data visualization space. This knowledge has been very fruitful. For instance, quite recently a colleague told me that one of his clients thought a particular dashboard had too many bar charts and he wanted some variation. I shared the following two quotes: Show data variation, not design variation. –Edward Tufte in The Visual Display of Quantitative Information Variety might be the spice of life, but, if it is introduced on a dashboard for its own sake, the display suffers. –Stephen Few in Information Dashboard Design Those quotes proved helpful for my colleague. Hopefully the following information will prove helpful to you. Additionally I would also like to draw attention to Alberto Cairo—a relatively new voice providing new insight. Each of these authors should be considered a must-read for anyone working in data visualization. Visualization design theory Dashboard design Sheet selection Visualization design theory Any discussion on designing dashboards should begin with information about constructing well-designed content. The quality of the dashboard layout and the utilization of technical tips and tricks do not matter if the content is subpar. In other words we should consider the worksheets displayed on dashboards and ensure that those worksheets are well-designed. Therefore, our discussion will begin with a consideration of visualization design principles. Regarding these principles, it's tempting to declare a set of rules such as: To plot change over time, use a line graph To show breakdowns of the whole, use a treemap To compare discrete elements, use a bar chart To visualize correlation, use a scatter plot But of course even a cursory review of the preceding list brings to mind many variations and alternatives! Thus, we will consider various rules while always keeping in mind that rules (at least rules such as these) are meant to be broken. Formatting rules The following formatting rules encompass fonts, lines, and bands. Fonts are, of course, an obvious formatting consideration. Lines and bands, however, may not be something you typically think of when formatting, especially when considering formatting from the perspective of Microsoft Word. But if we broaden formatting considerations to think of Adobe Illustrator, InDesign, and other graphic design tools, then lines and bands are certainly considered. This illustrates that data visualization is closely related to graphic design and that formatting considers much more than just textual layout. Rule – keep the font choice simple Typically using one or two fonts on a dashboard is advisable. More fonts can create a confusing environment and interfere with readability. Fonts chosen for titles should be thick and solid while the body fonts should be easy to read. As of Tableau 10.0 choosing appropriate fonts is simple because of the new Tableau Font Family. Go to Format | Font to display the Format Font window to see and choose these new fonts: Assuming your dashboard is primarily intended for the screen, sans serif fonts are best. On the rare occasions a dashboard is primarily intended for print, you may consider serif fonts; particularly if the print resolution is high. Rule – Trend line > Fever line > Reference line > Drop line > Zero line > Grid line The preceding pseudo formula is intended to communicate line visibility. For example, trend line visibility should be greater than fever line visibility. Visibility is usually enhanced by increasing line thickness but may be enhanced via color saturation or by choosing a dotted or dashed line over a solid line. The trend line, if present, is usually the most visible line on the graph. Trend lines are displayed via the Analytics pane and can be adjusted via Format à Lines. The fever line (for example, the line used on a time-series chart) should not be so heavy as to obscure twists and turns in the data. Although a fever line may be displayed as dotted or dashed by utilizing the Pages shelf, this is usually not advisable because it may obscure visibility. The thickness of a fever line can be adjusted by clicking on the Size shelf in the Marks View card. Reference lines are usually less prevalent than either fever or trend lines and can be formatted by going to Format | Reference lines. Drop lines are not frequently used. To deploy drop lines, right-click in a blank portion of the view and go to Drop lines | Show drop lines. Next, click on a point in the view to display a drop line. To format droplines, go to Format | Droplines. Drop lines are relevant only if at least one axis is utilized in the visualization. Zero lines (sometimes referred to as base lines) display only if zero or negative values are included in the view or positive numerical values are relatively close to zero. Format zero lines by going to Format | Lines. Grid lines should be the most muted lines on the view and may be dispensed with altogether. Format grid lines by going to Format | Lines. Rule – band in groups of three to five Visualizations comprised of a tall table of text or horizontal bars should segment dimension members in groups of three to five. Exercise – banding Navigate to https://public.tableau.com/profile/david.baldwin#!/ to locate and download the workbook. Navigate to the worksheet titled Banding. Select the Superstore data source and place Product Name on the Rows shelf. Double-click on Discount, Profit, Quantity, and Sales. Navigate to Format | Shading and set Band Size under Row Banding so that three to five lines of text are encompassed by each band. Be sure to set an appropriate color for both Pane and Header: Note that after completing the preceding five steps, Tableau defaulted to banding every other row. This default formatting is fine for a short table but is quite busy for a tall table. The band in groups of three to five rule is influenced by Dona W. Wong, who, in her book The Wall Street Journal Guide to Information Graphics, recommends separating long tables or bar charts with thin rules to separate the bars in groups of three to five to help the readers read across. Color rules It seems slightly ironic to discuss color rules in a black-and-white publication such as Mastering Tableau. Nonetheless, even in a monochromatic setting, a discussion of color is relevant. For example, exclusive use of black text communicates differently than using variations of gray. The following survey of color rules should be helpful to ensure that you use colors effectively in a variety of settings. Rule – keep colors simple and limited Stick to the basic hues and provide only a few (perhaps three to five) hue variations. Alberto Cairo, in his book The Functional Art: An Introduction to Information Graphics and Visualization, provides insights into why this is important. The limited capacity of our visual working memory helps explain why it's not advisable to use more than four or five colors or pictograms to identify different phenomena on maps and charts. Rule – respect the psychological implication of colors In Western society, there is a color vocabulary so pervasive, it's second nature. Exit signs marking stairwell locations are red. Traffic cones are orange. Baby boys are traditionally dressed in blue while baby girls wear pink. Similarly, in Tableau reds and oranges should usually be associated with negative performance while blues and greens should be associated with positive performance. Using colors counterintuitively can cause confusion. Rule – be colorblind-friendly Colorblindness is usually manifested as an inability to distinguish red and green or blue and yellow. Red/green and blue/yellow are on opposite sides of the color wheel. Consequently, the challenges these color combinations present for colorblind individuals can be easily recreated with image editing software such as Photoshop. If you are not colorblind, convert an image with these color combinations to grayscale and observe. The challenge presented to the 8.0% of the males and 0.5% of the females who are color blind becomes immediately obvious! Rule – use pure colors sparingly The resulting colors from the following exercise should be a very vibrant red, green, and blue. Depending on the monitor, you may even find it difficult to stare directly at the colors. These are known as pure colors and should be used sparingly; perhaps only to highlight particularly important items. Exercise – using pure colors Open the workbook and navigate to the worksheet entitled Pure Colors. Select the Superstore data source and place Category on both the Rows shelf and the Color shelf. Set the Fit to Entire View. Click on the Color shelf and choose Edit Colors…. In the Edit Colors dialog box, double-click on the color icons to the left of each dimension member; that is, Furniture, Office Supplies, and Technology: Within the resulting dialog box, set furniture to an HTML value of #0000ff, Office Supplies to #ff0000, and Technology to #00ff00. Rule – color variations over symbol variation Deciphering different symbols takes more mental energy for the end user than distinguishing color. Therefore color variation should be used over symbol variation. This rule can actually be observed in Tableau defaults. Create a scatter plot and place a dimension with many members on the Color shelf and Shape shelf respectively. Note that by default, the view will display 20 unique colors but only 10 unique shapes. Older versions of Tableau (such as Tableau 9.0) display warnings that include text such as “…the recommended maximum for this shelf is 10”: Visualization type rules We won't spend time here to delve into a lengthy list of visualization type rules. However, it does seem appropriate to review at least a couple of rules. In the following exercise, we will consider keeping shapes simple and effectively using pie charts. Rule – keep shapes simple Too many shape details impede comprehension. This is because shape details draw the user's focus away from the data. Consider the following exercise on using two different shopping cart images. Exercise – shapes Open the workbook associated and navigate to the worksheet entitled Simple Shopping Cart. Note that the visualization is a scatterplot showing the top 10 selling Sub-Categories in terms of total sales and profits. On your computer, navigate to the Shapes directory located in the My Tableau Repository. On my computer, the path is C:UsersDavid BaldwinDocumentsMy Tableau RepositoryShapes. Within the Shapes directory, create a folder named My Shapes. Reference the link included in the comment section of the worksheet to download the assets. In the downloaded material, find the images titled Shopping_Cart and Shopping_Cart_3D and copy those images into the My Shapes directory created previously. Within Tableau, access the Simple Shopping Cart worksheet. Click on the Shape shelf and then select More Shapes. Within the Edit Shape dialog box, click on the Reload Shapes button. Select the My Shapes palette and set the shape to the simple shopping cart. After closing the dialog box, click on the Size shelf and adjust as desired. Also adjust other aspects of the visualization as desired. Navigate to the 3D Shopping Cart worksheet and then repeat steps 8 to 11. Instead of using the simple shopping cart, use the 3D shopping cart: Compare the two visualizations. Which version of the shopping cart is more attractive? Likely the cart with the 3D look was your choice. Why not choose the more attractive image? Making visualizations attractive is only of secondary concern. The primary goal is to display the data as clearly and efficiently as possible. A simple shape is grasped more quickly and intuitively than a complex shape. Besides, the cuteness of the 3D image will quickly wear off. Rule – use pie charts sparingly Edward Tufte makes an acrid (and somewhat humorous) comment against the use of pie charts in his book The Visual Display of Quantitative Information. A table is nearly always better than a dumb pie chart; the only worse design than a pie chart is several of them. Given their low density and failure to order numbers along a visual dimension, pie charts should never be used. The present sentiment in data visualization circles is largely sympathetic to Tufte's criticism. There may, however, be some exceptions; that is, some circumstances where a pie chart is optimal. Consider the following visualization: Which of the four visualizations best demonstrates that A accounts for 25% of the whole? Clearly it is the pie chart! Therefore, perhaps it is fairer to refer to pie charts as limited and to use them sparingly as opposed to considering them inherently evil. Compromises In this section, we will transition from more or less strict rules to compromises. Often, building visualizations is a balancing act. It's common to encounter contradictory directions from books, blogs, consultants, and within organizations. One person may insist on utilizing every pixel of space while another urges simplicity and whitespace. One counsels a guided approach while another recommends building wide open dashboards that allow end users to discover their own path. Avant gardes may crave esoteric visualizations while those of a more conservative bent prefer to stay with the conventional. We now explore a few of the more common competing requests and suggests compromises. Make the dashboard simple versus make the dashboard robust Recently a colleague showed me a complex dashboard he had just completed. Although he was pleased that he had managed to get it working well, he felt the need to apologize by saying, “I know it's dense and complex but it's what the client wanted.” Occam's Razor encourages the simplest possible solution for any problem. For my colleague's dashboard, the simplest solution was rather complex. This is OK! Complexity in Tableau dashboarding need not be shunned. But a clear understanding of some basic guidelines can help the author intelligently determine how to compromise between demands for simplicity and demands for robustness. More frequent data updates necessitate simpler design. Some Tableau dashboards may be near-real-time. Third-party technology may be utilized to force a browser displaying a dashboard via Tableau Server to refresh every few minutes to ensure the absolute latest data displays. In such cases, the design should be quite simple. The end user must be able to see at a glance all pertinent data and should not use that dashboard for extensive analysis. Conversely, a dashboard that is refreshed monthly can support high complexity and thus may be used for deep exploration. Greater end user expertise supports greater dashboard complexity. Know thy users. If they want easy, at-a-glance visualizations, keep the dashboards simple. If they like deep dives, design accordingly. Smaller audiences require more precise design. If only a few people monitor a given dashboard, it may require a highly customized approach. In such cases, specifications may be detailed, complex, and difficult to execute and maintain because the small user base has expectations that may not be natively easy to produce in Tableau. Screen resolution and visualization complexity are proportional. Users with low-resolution devices will need to interact fairly simply with a dashboard. Thus the design of such a dashboard will likely be correspondingly uncomplicated. Conversely, high-resolution devices support greater complexity. Greater distance from the screen requires larger dashboard elements. If the dashboard is designed for conference room viewing, the elements on the dashboard may need to be fairly large to meet the viewing needs of those far from the screen. Thus the dashboard will likely be relatively simple. Conversely, a dashboard to be viewed primarily on end users desktops can be more complex. Although these points are all about simple versus complex, do not equate simple with easy. A simple and elegantly designed dashboard can be more difficult to create than a complex dashboard. In the words of Steve Jobs: Simple can be harder than complex: You have to work hard to get your thinking clean to make it simple. But it's worth it in the end because once you get there, you can move mountains. Present dense information versus present sparse information Normally, a line graph should have a maximum of four to five lines. However, there are times when you may wish to display many lines. A compromise can be achieved by presenting many lines and empowering the end user to highlight as desired. The following line graph displays the percentage of Internet usage by country from 2000 to 2012. Those countries with the largest increases have been highlighted. Assuming that Highlight Selected Items has been activated within the Color legend, the end user can select items (countries in this case) from the legend to highlight as desired. Or, even better, a worksheet can be created listing all countries and used in conjunction with a highlight action on a dashboard to focus attention on selected items on the line graph: Tell a story versus allow a story to be discovered Albert Cairo, in his excellent book The Functional Art: An Introduction to Information Graphics and Visualization, includes a section where he interviews prominent data visualization and information graphics professionals. Two of these interviews are remarkable for their opposing views. I… feel that many visualization designers try to transform the user into an editor.  They create these amazing interactive tools with tons of bubbles, lines, bars, filters, and scrubber bars, and expect readers to figure the story out by themselves, and draw conclusions from the data. That's not an approach to information graphics I like. – Jim Grimwade The most fascinating thing about the rise of data visualization is exactly that anyone can explore all those large data sets without anyone telling us what the key insight is. – Moritz Stefaner Fortunately, the compromise position can be found in the Jim Grimwade interview: [The New York Times presents] complex sets of data, and they let you go really deep into the figures and their connections. But beforehand, they give you some context, some pointers as to what you can do with those data. If you don't do this… you will end up with a visualization that may look really beautiful and intricate, but that will leave readers wondering, What has this thing really told me? What is this useful for? – Jim Grimwade Although the case scenarios considered in the preceding quotes are likely quite different from the Tableau work you are involved in, the underlying principles remain the same. You can choose to tell a story or build a platform that allows the discovery of numerous stories. Your choice will differ depending on the given dataset and audience. If you choose to create a platform for story discovery, be sure to take the New York Times approach suggested by Grimwade. Provide hints, pointers, and good documentation to lead your end user to successfully interact with the story you wish to tell or successfully discover their own story. Document, Document, Document! But don't use any space! Immediately above we considered the suggestion Provide hints, pointers, and good documentation… but there's an issue. These things take space. Dashboard space is precious. Often Tableau authors are asked to squeeze more and more stuff on a dashboard and are hence looking for ways to conserve space. Here are some suggestions for maximizing documentation on a dashboard while minimally impacting screen real estate. Craft titles for clear communication Titles are expected. Not just a title for a dashboard and worksheets on the dashboard, but also titles for legends, filters and other objects. These titles can be used for effective and efficient documentation. For instance a filter should not just read Market. Instead it should say something like Select a Market. Notice the imperative statement. The user is being told to do something and this is a helpful hint. Adding a couple of words to a title will usually not impact dashboard space. Use subtitles to relay instructions A subtitle will take some extra space but it does not have to be much. A small, italicized font immediately underneath a title is an obvious place a user will look at for guidance. Consider an example: red represents loss. This short sentence could be used as a subtitle that may eliminate the need for a legend and thus actually save space. Use intuitive icons Consider a use case of navigating from one dashboard to another. Of course you could associate an action with some hyperlinked text stating Click here to navigate to another dashboard. But this seems quite unnecessary when an action can be associated with a small, innocuous arrow, such as is natively used in PowerPoint, to communicate the same thing. Store more extensive documentation in a tooltip associated with a help icon. A small question mark in the top-right corner of an application is common. This clearly communicates where to go if additional help is required. As shown in the following exercise, it's easy to create a similar feature on a Tableau dashboard. Summary Hence from this article we studied to create some effective dashboards that are very beneficial in corporate world as a statistical tool to calculate average growth in terms of revenue. Resources for Article: Further resources on this subject: Say Hi to Tableau [article] Tableau Data Extract Best Practices [article] Getting Started with Tableau Public [article]
Read more
  • 0
  • 0
  • 2074
article-image-elastic-stack-overview
Packt
10 Jan 2017
9 min read
Save for later

Elastic Stack Overview

Packt
10 Jan 2017
9 min read
In this article by Ravi Kumar Gupta and Yuvraj Gupta, from the book, Mastering Elastic Stack, we will have an overview of Elastic Stack, it's very easy to read a log file of a few MBs or hundreds, so is it to keep data of this size in databases or files and still get sense out of it. But then a day comes when this data takes terabytes and petabytes, and even notepad++ would refuse to open a data file of a few hundred MBs. Then we start to find something for huge log management, or something that can index the data properly and make sense out of it. If you Google this, you would stumble upon ELK Stack. Elasticsearch manages your data, Logstash reads the data from different sources, and Kibana makes a fine visualization of it. Recently, ELK Stack has evolved as Elastic Stack. We will get to know about it in this article. The following are the points that will be covered in this article: Introduction to ELK Stack The birth of Elastic Stack Who uses the Stack (For more resources related to this topic, see here.) Introduction to ELK Stack It all began with Shay Banon, who started it as an open source project, Elasticsearch, successor of Compass, which gained popularity to become one of the top open source database engines. Later, based on the distributed model of working, Kibana was introduced to visualize the data present in Elasticsearch. Earlier, to put data into Elasticsearch we had rivers, which provided us with a specific input via which we inserted data into Elasticsearch. However, with growing popularity this setup required a tool via which we can insert data into Elasticsearch and have flexibility to perform various transformations on data to make unstructured data structured to have full control on how to process the data. Based on this premise, Logstash was born, which was then incorporated into the Stack, and together these three tools, Elasticsearch, Logstash, and Kibana were named ELK Stack. The following diagram is a simple data pipeline using ELK Stack: As we can see from the preceding figure, data is read using Logstash and indexed to Elasticsearch. Later we can use Kibana to read the indices from Elasticsearch and visualize it using charts and lists. Let's understand these components separately and the role they play in the making of the Stack. Logstash As we got to know that rivers were used initially to put data into Elasticsearch before ELK Stack. For ELK Stack, Logstash is the entry point for all types of data. Logstash has so many plugins to read data from a number of sources and so many output plugins to submit data to a variety of destinations and one of those is the Elasticsearch plugin, which helps to send data to Elasticsearch. After Logstash became popular, eventually rivers got deprecated as they made the cluster unstable and also performance issues were observed. Logstash does not ship data from one end to another; it helps us with collecting raw data and modifying/filtering it to convert it to something meaningful, formatted, and organized. The updated data is then sent to Elasticsearch. If there is no plugin available to support reading data from a specific source, or writing the data to a location, or modifying it in your way, Logstash is flexible enough to allow you to write your own plugins. Simply put, Logstash is open source, highly flexible, rich with plugins, can read your data from your choice of location, normalizes it as per your defined configurations, and sends it to a particular destination as per the requirements. Elasticsearch All of the data read by Logstash is sent to Elasticsearch for indexing. There is a lot more than just indexing. Elasticsearch is not only used to index data, but it is a full-text search engine, highly scalable, distributed, and offers many more things. Elasticsearch manages and maintains your data in the form of indices, offers you to query, access, and aggregate the data using its APIs. Elasticsearch is based on Lucene, thus providing you all of the features that Lucene does. Kibana Kibana uses Elasticsearch APIs to read/query data from Elasticsearch indices to visualize and analyze in the form of charts, graphs and tables. Kibana is in the form of a web application, providing you a highly configurable user interface that lets you query the data, create a number of charts to visualize, and make actual sense out of the data stored. After a robust ELK Stack, as time passed, a few important and complex demands took place, such as authentication, security, notifications, and so on. This demand led for few other tools such as Watcher (providing alerting and notification based on changes in data), Shield (authentication and authorization for securing clusters), Marvel (monitoring statistics of the cluster), ES-Hadoop, Curator, and Graph as requirement arose. The birth of Elastic Stack All the jobs of reading data were done using Logstash, but that's resource consuming. Since Logstash runs on JVM, it consumes a good amount of memory. The community realized the need of improvement and to make the pipelining process resource friendly and lightweight. Back in 2015, Packetbeat was born, a project which was an effort to make a network packet analyzer that could read from different protocols, parse the data, and ship to Elasticsearch. Being lightweight in nature did the trick and a new concept of Beats was formed. Beats are written in Go programming language. The project evolved a lot, and now ELK stack was no more than just Elasticsearch, Logstash, and Kibana, but Beats also became a significant component. The pipeline now looked as follows: Beat A Beat reads data, parses it, and can ship to either Elasticsearch or Logstash. The difference is that they are lightweight, serve a specific purpose, and are installed as agents. There are few beats available such as Topbeat, Filebeat, Packetbeat, and so on, which are supported and provided by the Elastic.co and a good number of Beats already written by the community. If you have a specific requirement, you can write your own Beat using the libbeat library. In simple words, Beats can be treated as very light weight agents to ship data to either Logstash or Elasticsearch and offer you an infrastructure using the libbeat library to create your own Beats. Together Elasticsearch, Logstash, Kibana, and Beats become Elastic Stack, formally known as ELK Stack. Elastic Stack did not just add Beats to its team, but they will be using the same version always. The starting version of the Elastic Stack will be 5.0.0 and the same version will apply to all the components. This version and release method is not only for Elastic Stack, but for other tools of the Elastic family as well. Due to so many tools, there was a problem of unification, wherein each tool had their own version and every version was not compatible with each other, hence leading to a problem. To solve this, now all of the tools will be built, tested, and released together. All of these components play a significant role in creating a pipeline. While Beats and Logstash are used to collect the data, parse it, and ship it, Elasticsearch creates indices, which is finally used by Kibana to make visualizations. While Elastic Stack helps with a pipeline, other tools add security, notifications, monitoring, and other such capabilities to the setup. Who uses Elastic Stack? In the past few years, implementations of Elastic Stack have been increasing very rapidly. In this section, we will consider a few case studies to understand how Elastic Stack has helped this development. Salesforce Salesforce developed a new plugin named ELF (Event Log Files) to collect Salesforce logged data to enable auditing of user activities. The purpose was to analyze the data to understand user behavior and trends in Salesforce. The plugin is available on GitHub at https://github.com/developerforce/elf_elk_docker. This plugin simplifies the Stack configuration and allows us to download ELF to get indexed and finally sensible data that can be visualized using Kibana. This implementation utilizes Elasticsearch, Logstash, and Kibana. CERN There is not just one use case that Elastic Stack helped CERN (European Organization for Nuclear Research), but five. At CERN, Elastic Stack is used for the following: Messaging Data monitoring Cloud benchmarking Infrastructure monitoring Job monitoring Multiple Kibana dashboards are used by CERN for a number of visualizations. Green Man Gaming This is an online gaming platform where game providers publish their games. The website wanted to make a difference by proving better gameplay. They started using Elastic Stack to do log analysis, search, and analysis of gameplay data. They began with setting up Kibana dashboards to gain insights about the counts of gamers by the country and currency used by gamers. That helped them to understand and streamline the support and help in order to provide an improved response. Apart from these case studies, Elastic Stack is used by a number of other companies to gain insights of the data they own. Sometimes, not all of the components are used, that is, not all of the times a Beat would be used and Logstash would be configured. Sometimes, only an Elasticsearch and Kibana combination is used. If we look at the users within the organization, all of the titles who are expected to do big data analysis, business intelligence, data visualizations, log analysis, and so on, can utilize Elastic Stack for their technical forte. A few of these titles are data scientists, DevOps, and so on. Stack competitors Well, it would be wrong to call for Elastic Stack competitors because Elastic Stack has been emerged as a strong competitor to many other tools in the market in recent years and is growing rapidly. A few of these are: Open source: Graylog: Visit https://www.graylog.org/ for more information InfluxDB: Visit https://influxdata.com/ for more information Others: Logscape: Visit http://logscape.com/ for more information Logscene: Visit http://sematext.com/logsene/ for more information Splunk: Visit http://www.splunk.com/ for more information Sumo Logic: Visit https://www.sumologic.com/ for more information Kibana competitors: Grafana: Visit http://grafana.org/ for more information Graphite: Visit https://graphiteapp.org/ for more information Elasticsearch competitors: Lucene/Solr: Visit http://lucene.apache.org/solr/ or https://lucene.apache.org/ for more information Sphinx: Visit http://sphinxsearch.com/ for more information Most of these compare with respect to log management, while Elastic Stack is much more than that. It offers you the ability to analyze any type of data and not just logs. Resources for Article: Further resources on this subject: AIO setup of OpenStack – preparing the infrastructure code environment [article] Our App and Tool Stack [article] Using the OpenStack Dashboard [article]
Read more
  • 0
  • 0
  • 2636

article-image-exploring-structure-motion-using-opencv
Packt
09 Jan 2017
20 min read
Save for later

Exploring Structure from Motion Using OpenCV

Packt
09 Jan 2017
20 min read
In this article by Roy Shilkrot, coauthor of the book Mastering OpenCV 3, we will discuss the notion of Structure from Motion (SfM), or better put, extracting geometric structures from images taken with a camera under motion, using OpenCV's API to help us. First, let's constrain the otherwise very broad approach to SfM using a single camera, usually called a monocular approach, and a discrete and sparse set of frames rather than a continuous video stream. These two constrains will greatly simplify the system we will sketch out in the coming pages and help us understand the fundamentals of any SfM method. In this article, we will cover the following: Structure from Motion concepts Estimating the camera motion from a pair of images (For more resources related to this topic, see here.) Throughout the article, we assume the use of a calibrated camera—one that was calibrated beforehand. Calibration is a ubiquitous operation in computer vision, fully supported in OpenCV using command-line tools. We, therefore, assume the existence of the camera's intrinsic parameters embodied in the K matrix and the distortion coefficients vector—the outputs from the calibration process. To make things clear in terms of language, from this point on, we will refer to a camera as a single view of the scene rather than to the optics and hardware taking the image. A camera has a position in space and a direction of view. Between two cameras, there is a translation element (movement through space) and a rotation of the direction of view. We will also unify the terms for the point in the scene, world, real, or 3D to be the same thing, a point that exists in our real world. The same goes for points in the image or 2D, which are points in the image coordinates, of some real 3D point that was projected on the camera sensor at that location and time. Structure from Motion concepts The first discrimination we should make is the difference between stereo (or indeed any multiview), 3D reconstruction using calibrated rigs, and SfM. A rig of two or more cameras assumes we already know what the "motion" between the cameras is, while in SfM, we don't know what this motion is and we wish to find it. Calibrated rigs, from a simplistic point of view, allow a much more accurate reconstruction of 3D geometry because there is no error in estimating the distance and rotation between the cameras—it is already known. The first step in implementing an SfM system is finding the motion between the cameras. OpenCV may help us in a number of ways to obtain this motion, specifically using the findFundamentalMat and findEssentialMat functions. Let's think for one moment of the goal behind choosing an SfM algorithm. In most cases, we wish to obtain the geometry of the scene, for example, where objects are in relation to the camera and what their form is. Having found the motion between the cameras picturing the same scene, from a reasonably similar point of view, we would now like to reconstruct the geometry. In computer vision jargon, this is known as triangulation, and there are plenty of ways to go about it. It may be done by way of ray intersection, where we construct two rays: one from each camera's center of projection and a point on each of the image planes. The intersection of these rays in space will, ideally, intersect at one 3D point in the real world that was imaged in each camera, as shown in the following diagram: In reality, ray intersection is highly unreliable. This is because the rays usually do not intersect, making us fall back to using the middle point on the shortest segment connecting the two rays. OpenCV contains a simple API for a more accurate form of triangulation, the triangulatePoints function, so this part we do not need to code on our own. After you have learned how to recover 3D geometry from two views, we will see how you can incorporate more views of the same scene to get an even richer reconstruction. At that point, most SfM methods try to optimize the bundle of estimated positions of our cameras and 3D points by means of Bundle Adjustment. OpenCV contains means for Bundle Adjustment in its new Image Stitching Toolbox. However, the beauty of working with OpenCV and C++ is the abundance of external tools that can be easily integrated into the pipeline. We will, therefore, see how to integrate an external bundle adjuster, the Ceres non-linear optimization package. Now that we have sketched an outline of our approach to SfM using OpenCV, we will see how each element can be implemented. Estimating the camera motion from a pair of images Before we set out to actually find the motion between two cameras, let's examine the inputs and the tools we have at hand to perform this operation. First, we have two images of the same scene from (hopefully not extremely) different positions in space. This is a powerful asset, and we will make sure that we use it. As for tools, we should take a look at mathematical objects that impose constraints over our images, cameras, and the scene. Two very useful mathematical objects are the fundamental matrix (denoted by F) and the essential matrix (denoted by E). They are mostly similar, except that the essential matrix is assuming usage of calibrated cameras; this is the case for us, so we will choose it. OpenCV allows us to find the fundamental matrix via the findFundamentalMat function and the essential matrix via the findEssentialMatrix function. Finding the essential matrix can be done as follows: Mat E = findEssentialMat(leftPoints, rightPoints, focal, pp); This function makes use of matching points in the "left" image, leftPoints, and "right" image, rightPoints, which we will discuss shortly, as well as two additional pieces of information from the camera's calibration: the focal length, focal, and principal point, pp. The essential matrix, E, is a 3 x 3 matrix, which imposes the following constraint on a point in one image and a point in the other image: x'K­TEKx = 0, where x is a point in the first image one, x' is the corresponding point in the second image, and K is the calibration matrix. This is extremely useful, as we are about to see. Another important fact we use is that the essential matrix is all we need in order to recover the two cameras' positions from our images, although only up to an arbitrary unit of scale. So, if we obtain the essential matrix, we know where each camera is positioned in space and where it is looking. We can easily calculate the matrix if we have enough of those constraint equations, simply because each equation can be used to solve for a small part of the matrix. In fact, OpenCV internally calculates it using just five point-pairs, but through the Random Sample Consensus algorithm (RANSAC), many more pairs can be used and make for a more robust solution. Point matching using rich feature descriptors Now we will make use of our constraint equations to calculate the essential matrix. To get our constraints, remember that for each point in image A, we must find a corresponding point in image B. We can achieve such a matching using OpenCV's extensive 2D feature-matching framework, which has greatly matured in the past few years. Feature extraction and descriptor matching is an essential process in computer vision and is used in many methods to perform all sorts of operations, for example, detecting the position and orientation of an object in the image or searching a big database of images for similar images through a given query. In essence, feature extraction means selecting points in the image that would make for good features and computing a descriptor for them. A descriptor is a vector of numbers that describes the surrounding environment around a feature point in an image. Different methods have different lengths and data types for their descriptor vectors. Descriptor Matching is the process of finding a corresponding feature from one set in another using its descriptor. OpenCV provides very easy and powerful methods to support feature extraction and matching. Let's examine a very simple feature extraction and matching scheme: vector<KeyPoint> keypts1, keypts2; Mat desc1, desc2; // detect keypoints and extract ORB descriptors Ptr<Feature2D> orb = ORB::create(2000); orb->detectAndCompute(img1, noArray(), keypts1, desc1); orb->detectAndCompute(img2, noArray(), keypts2, desc2); // matching descriptors Ptr<DescriptorMatcher> matcher = DescriptorMatcher::create("BruteForce-Hamming"); vector<DMatch> matches; matcher->match(desc1, desc2, matches); You may have already seen similar OpenCV code, but let's review it quickly. Our goal is to obtain three elements: feature points for two images, descriptors for them, and a matching between the two sets of features. OpenCV provides a range of feature detectors, descriptor extractors, and matchers. In this simple example, we use the ORB class to get both the 2D location of Oriented BRIEF (ORB) (where Binary Robust Independent Elementary Features (BRIEF)) feature points and their respective descriptors. We use a brute-force binary matcher to get the matching, which is the most straightforward way to match two feature sets by comparing each feature in the first set to each feature in the second set (hence the phrasing "brute-force"). In the following image, we will see a matching of feature points on two images from the Fountain-P11 sequence found at http://cvlab.epfl.ch/~strecha/multiview/denseMVS.html: Practically, raw matching like we just performed is good only up to a certain level, and many matches are probably erroneous. For that reason, most SfM methods perform some form of filtering on the matches to ensure correctness and reduce errors. One form of filtering, which is built into OpenCV's brute-force matcher, is cross-check filtering. That is, a match is considered true if a feature of the first image matches a feature of the second image, and the reverse check also matches the feature of the second image with the feature of the first image. Another common filtering mechanism, used in the provided code, is to filter based on the fact that the two images are of the same scene and have a certain stereo-view relationship between them. In practice, the filter tries to robustly calculate the fundamental or essential matrix and retain those feature pairs that correspond to this calculation with small errors. An alternative to using rich features, such as ORB, is to use optical flow. The following information box provides a short overview of optical flow. It is possible to use optical flow instead of descriptor matching to find the required point matching between two images, while the rest of the SfM pipeline remains the same. OpenCV recently extended its API for getting the flow field from two images, and now it is faster and more powerful. Optical flow It is the process of matching selected points from one image to another, assuming that both images are part of a sequence and relatively close to one another. Most optical flow methods compare a small region, known as the search window or patch, around each point from image A to the same area in image B. Following a very common rule in computer vision, called the brightness constancy constraint (and other names), the small patches of the image will not change drastically from one image to other, and therefore the magnitude of their subtraction should be close to zero. In addition to matching patches, newer methods of optical flow use a number of additional methods to get better results. One is using image pyramids, which are smaller and smaller resized versions of the image, which allow for working from coarse to-fine—a very well-used trick in computer vision. Another method is to define global constraints on the flow field, assuming that the points close to each other move together in the same direction. Finding camera matrices Now that we have obtained matches between keypoints, we can calculate the essential matrix. However, we must first align our matching points into two arrays, where an index in one array corresponds to the same index in the other. This is required by the findEssentialMat function as we've seen in the Estimating Camera Motion section. We would also need to convert the KeyPoint structure to a Point2f structure. We must pay special attention to the queryIdx and trainIdx member variables of DMatch, the OpenCV struct that holds a match between two keypoints, as they must align with the way we used the DescriptorMatcher::match() function. The following code section shows how to align a matching into two corresponding sets of 2D points, and how these can be used to find the essential matrix: vector<KeyPoint> leftKpts, rightKpts; // ... obtain keypoints using a feature extractor vector<DMatch> matches; // ... obtain matches using a descriptor matcher //align left and right point sets vector<Point2f> leftPts, rightPts; for (size_t i = 0; i < matches.size(); i++) { // queryIdx is the "left" image leftPts.push_back(leftKpts[matches[i].queryIdx].pt); // trainIdx is the "right" image rightPts.push_back(rightKpts[matches[i].trainIdx].pt); } //robustly find the Essential Matrix Mat status; Mat E = findEssentialMat( leftPts, //points from left image rightPts, //points from right image focal, //camera focal length factor pp, //camera principal point cv::RANSAC, //use RANSAC for a robust solution 0.999, //desired solution confidence level 1.0, //point-to-epipolar-line threshold status); //binary vector for inliers We may later use the status binary vector to prune those points that align with the recovered essential matrix. Refer to the following image for an illustration of point matching after pruning. The red arrows mark feature matches that were removed in the process of finding the matrix, and the green arrows are feature matches that were kept: Now we are ready to find the camera matrices; however, the new OpenCV 3 API makes things very easy for us by introducing the recoverPose function. First, we will briefly examine the structure of the camera matrix we will use: This is the model for our camera; it consists of two elements, rotation (denoted as R) and translation (denoted as t). The interesting thing about it is that it holds a very essential equation: x = PX, where x is a 2D point on the image and X is a 3D point in space. There is more to it, but this matrix gives us a very important relationship between the image points and the scene points. So, now that we have a motivation for finding the camera matrices, we will see how it can be done. The following code section shows how to decompose the essential matrix into the rotation and translation elements: Mat E; // ... find the essential matrix Mat R, t; //placeholders for rotation and translation //Find Pright camera matrix from the essential matrix //Cheirality check is performed internally. recoverPose(E, leftPts, rightPts, R, t, focal, pp, mask); Very simple. Without going too deep into mathematical interpretation, this conversion of the essential matrix to rotation and translation is possible because the essential matrix was originally composed by these two elements. Strictly for satisfying our curiosity, we can look at the following equation for the essential matrix, which appears in the literature:. We see that it is composed of (some form of) a translation element, t, and a rotational element, R. Note that a cheirality check is internally performed in the recoverPose function. The cheirality check makes sure that all triangulated 3D points are in front of the reconstructed camera. Camera matrix recovery from the essential matrix has in fact four possible solutions, but the only correct solution is the one that will produce triangulated points in front of the camera, hence the need for a cheirality check. Note that what we just did only gives us one camera matrix, and for triangulation, we require two camera matrices. This operation assumes that one camera matrix is fixed and canonical (no rotation and no translation): The other camera that we recovered from the essential matrix has moved and rotated in relation to the fixed one. This also means that any of the 3D points that we recover from these two camera matrices will have the first camera at the world origin point (0, 0, 0). One more thing we can think of adding to our method is error checking. Many times, the calculation of an essential matrix from point matching is erroneous, and this affects the resulting camera matrices. Continuing to triangulate with faulty camera matrices is pointless. We can install a check to see if the rotation element is a valid rotation matrix. Keeping in mind that rotation matrices must have a determinant of 1 (or -1), we can simply do the following: bool CheckCoherentRotation(const cv::Mat_<double>& R) { if (fabsf(determinant(R)) - 1.0 > 1e-07) { cerr << "rotation matrix is invalid" << endl; return false; } return true; } We can now see how all these elements combine into a function that recovers the P matrices. First, we will introduce some convenience data structures and type short hands: typedef std::vector<cv::KeyPoint> Keypoints; typedef std::vector<cv::Point2f> Points2f; typedef std::vector<cv::Point3f> Points3f; typedef std::vector<cv::DMatch> Matching; struct Features { //2D features Keypoints keyPoints; Points2f points; cv::Mat descriptors; }; struct Intrinsics { //camera intrinsic parameters cv::Mat K; cv::Mat Kinv; cv::Mat distortion; }; Now, we can write the camera matrix finding function: void findCameraMatricesFromMatch( constIntrinsics& intrin, constMatching& matches, constFeatures& featuresLeft, constFeatures& featuresRight, cv::Matx34f& Pleft, cv::Matx34f& Pright) { { //Note: assuming fx = fy const double focal = intrin.K.at<float>(0, 0); const cv::Point2d pp(intrin.K.at<float>(0, 2), intrin.K.at<float>(1, 2)); //align left and right point sets using the matching Features left; Features right; GetAlignedPointsFromMatch( featuresLeft, featuresRight, matches, left, right); //find essential matrix Mat E, mask; E = findEssentialMat( left.points, right.points, focal, pp, RANSAC, 0.999, 1.0, mask); Mat_<double> R, t; //Find Pright camera matrix from the essential matrix recoverPose(E, left.points, right.points, R, t, focal, pp, mask); Pleft = Matx34f::eye(); Pright = Matx34f(R(0,0), R(0,1), R(0,2), t(0), R(1,0), R(1,1), R(1,2), t(1), R(2,0), R(2,1), R(2,2), t(2)); } At this point, we have the two cameras that we need in order to reconstruct the scene. The canonical first camera, in the Pleft variable, and the second camera we calculated, form the essential matrix in the Pright variable. Choosing the image pair to use first Given we have more than just two image views of the scene, we must choose which two views we will start the reconstruction from. In their paper, Snavely et al. suggest that we pick the two views that have the least number of homography inliers. A homography is a relationship between two images or sets of points that lie on a plane; the homography matrix defines the transformation from one plane to another. In case of an image or a set of 2D points, the homography matrix is of size 3 x 3. When Snavely et al. look for the lowest inlier ratio, they essentially suggest to calculate the homography matrix between all pairs of images and pick the pair whose points mostly do not correspond with the homography matrix. This means the geometry of the scene in these two views is not planar or at least not the same plane in both views, which helps when doing 3D reconstruction. For reconstruction, it is best to look at a complex scene with non-planar geometry, with things closer and farther away from the camera. The following code snippet shows how to use OpenCV's findHomography function to count the number of inliers between two views whose features were already extracted and matched: int findHomographyInliers( const Features& left, const Features& right, const Matching& matches) { //Get aligned feature vectors Features alignedLeft; Features alignedRight; GetAlignedPointsFromMatch(left, right, matches, alignedLeft, alignedRight); //Calculate homography with at least 4 points Mat inlierMask; Mat homography; if(matches.size() >= 4) { homography = findHomography(alignedLeft.points, alignedRight.points, cv::RANSAC, RANSAC_THRESHOLD, inlierMask); } if(matches.size() < 4 or homography.empty()) { return 0; } return countNonZero(inlierMask); } The next step is to perform this operation on all pairs of image views in our bundle and sort them based on the ratio of homography inliers to outliers: //sort pairwise matches to find the lowest Homography inliers map<float, ImagePair> pairInliersCt; const size_t numImages = mImages.size(); //scan all possible image pairs (symmetric) for (size_t i = 0; i < numImages - 1; i++) { for (size_t j = i + 1; j < numImages; j++) { if (mFeatureMatchMatrix[i][j].size() < MIN_POINT_CT) { //Not enough points in matching pairInliersCt[1.0] = {i, j}; continue; } //Find number of homography inliers const int numInliers = findHomographyInliers( mImageFeatures[i], mImageFeatures[j], mFeatureMatchMatrix[i][j]); const float inliersRatio = (float)numInliers / (float)(mFeatureMatchMatrix[i][j].size()); pairInliersCt[inliersRatio] = {i, j}; } } Note that the std::map<float, ImagePair> will internally sort the pairs based on the map's key: the inliers ratio. We then simply need to traverse this map from the beginning to find the image pair with least inlier ratio, and if that pair cannot be used, we can easily skip ahead to the next pair. Summary In this article, we saw how OpenCV v3 can help us approach Structure from Motion in a manner that is both simple to code and to understand. OpenCV v3's new API contains a number of useful functions and data structures that make our lives easier and also assist in a cleaner implementation. However, the state-of-the-art SfM methods are far more complex. There are many issues we choose to disregard in favor of simplicity, and plenty more error examinations that are usually in place. Our chosen methods for the different elements of SfM can also be revisited. Some methods even use the N-view triangulation once they understand the relationship between the features in multiple images. If we would like to extend and deepen our familiarity with SfM, we will certainly benefit from looking at other open source SfM libraries. One particularly interesting project is libMV, which implements a vast array of SfM elements that may be interchanged to get the best results. There is a great body of work from University of Washington that provides tools for many flavors of SfM (Bundler and VisualSfM). This work inspired an online product from Microsoft, called PhotoSynth, and 123D Catch from Adobe. There are many more implementations of SfM readily available online, and one must only search to find quite a lot of them. Resources for Article: Further resources on this subject: Basics of Image Histograms in OpenCV [article] OpenCV: Image Processing using Morphological Filters [article] Face Detection and Tracking Using ROS, Open-CV and Dynamixel Servos [article]
Read more
  • 0
  • 1
  • 23233

article-image-deep-learning-and-regression-analysis
Packt
09 Jan 2017
6 min read
Save for later

Deep learning and regression analysis

Packt
09 Jan 2017
6 min read
In this article by Richard M. Reese and Jennifer L. Reese, authors of the book, Java for Data Science, We will discuss neural networks can be used to perform regression analysis. However, other techniques may offer a more effective solution. With regression analysis, we want to predict a result based on several input variables (For more resources related to this topic, see here.) We can perform regression analysis using an output layer that consists of a single neuron that sums the weighted input plus bias of the previous hidden layer. Thus, the result is a single value representing the regression. Preparing the data We will use a car evaluation database to demonstrate how to predict the acceptability of a car based on a series of attributes. The file containing the data we will be using can be downloaded from: http://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data. It consists of car data such as price, number of passengers, and safety information, and an assessment of its overall quality. It is this latter element that we will try to predict. The comma-delimited values in each attribute are shown next, along with substitutions. The substitutions are needed because the model expects numeric data: Attribute Original value Substituted value Buying price vhigh, high, med, low 3,2,1,0 Maintenance price vhigh, high, med, low 3,2,1,0 Number of doors 2, 3, 4, 5-more 2,3,4,5 Seating 2, 4, more 2,4,5 Cargo space small, med, big 0,1,2 Safety low, med, high 0,1,2 There are 1,728 instances in the file. The cars are marked with four classes: Class Number of instances Percentage of instances Original value Substituted value Unacceptable 1210 70.023% unacc 0 Acceptable 384 22.222% acc 1 Good 69 3.99% good 2 Very good 65 3.76% v-good 3 Setting up the class We start with the definition of a CarRegressionExample class, as shown next: public class CarRegressionExample { public CarRegressionExample() { try { ... } catch (IOException | InterruptedException ex) { // Handle exceptions } } public static void main(String[] args) { new CarRegressionExample(); } } Reading and preparing the data The first task is to read in the data. We will use the CSVRecordReader class to get the data: RecordReader recordReader = new CSVRecordReader(0, ","); recordReader.initialize(new FileSplit(new File("car.txt"))); DataSetIterator iterator = new RecordReaderDataSetIterator(recordReader, 1728, 6, 4); With this dataset, we will split the data into two sets. Sixty five percent of the data is used for training and the rest for testing: DataSet dataset = iterator.next(); dataset.shuffle(); SplitTestAndTrain testAndTrain = dataset.splitTestAndTrain(0.65); DataSet trainingData = testAndTrain.getTrain(); DataSet testData = testAndTrain.getTest(); The data now needs to be normalized: DataNormalization normalizer = new NormalizerStandardize(); normalizer.fit(trainingData); normalizer.transform(trainingData); normalizer.transform(testData); We are now ready to build the model. Building the model A MultiLayerConfiguration instance is created using a series of NeuralNetConfiguration.Builder methods. The following is the dice used. We will discuss the individual methods following the code. Note that this configuration uses two layers. The last layer uses the softmax activation function, which is used for regression analysis: MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder() .iterations(1000) .activation("relu") .weightInit(WeightInit.XAVIER) .learningRate(0.4) .list() .layer(0, new DenseLayer.Builder() .nIn(6).nOut(3) .build()) .layer(1, new OutputLayer .Builder(LossFunctions.LossFunction .NEGATIVELOGLIKELIHOOD) .activation("softmax") .nIn(3).nOut(4).build()) .backprop(true).pretrain(false) .build(); Two layers are created. The first is the input layer. The DenseLayer.Builder class is used to create this layer. The DenseLayer class is a feed-forward and fully connected layer. The created layer uses the six car attributes as input. The output consists of three neurons that are fed into the output layer and is duplicated here for your convenience: .layer(0, new DenseLayer.Builder() .nIn(6).nOut(3) .build()) The second layer is the output layer created with the OutputLayer.Builder class. It uses a loss function as the argument of its constructor. The softmax activation function is used since we are performing regression as shown here: .layer(1, new OutputLayer .Builder(LossFunctions.LossFunction .NEGATIVELOGLIKELIHOOD) .activation("softmax") .nIn(3).nOut(4).build()) Next, a MultiLayerNetwork instance is created using the configuration. The model is initialized, its listeners are set, and then the fit method is invoked to perform the actual training. The ScoreIterationListener instance will display information as the model trains which we will see shortly in the output of this example. Its constructor argument specifies the frequency that information is displayed: MultiLayerNetwork model = new MultiLayerNetwork(conf); model.init(); model.setListeners(new ScoreIterationListener(100)); model.fit(trainingData); We are now ready to evaluate the model. Evaluating the model In the next sequence of code, we evaluate the model against the training dataset. An Evaluation instance is created using an argument specifying that there are four classes. The test data is fed into the model using the output method. The eval method takes the output of the model and compares it against the test data classes to generate statistics. The getLabels method returns the expected values: Evaluation evaluation = new Evaluation(4); INDArray output = model.output(testData.getFeatureMatrix()); evaluation.eval(testData.getLabels(), output); out.println(evaluation.stats()); The output of the training follows, which is produced by the ScoreIterationListener class. However, the values you get may differ due to how the data is selected and analyzed. Notice that the score improves with the iterations but levels out after about 500 iterations: 12:43:35.685 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 0 is 1.443480901811554 12:43:36.094 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 100 is 0.3259061845624861 12:43:36.390 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 200 is 0.2630572026049783 12:43:36.676 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 300 is 0.24061281470878784 12:43:36.977 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 400 is 0.22955121170274934 12:43:37.292 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 500 is 0.22249920540161677 12:43:37.575 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 600 is 0.2169898450109222 12:43:37.872 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 700 is 0.21271599814600958 12:43:38.161 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 800 is 0.2075677126088741 12:43:38.451 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 900 is 0.20047317735870715 This is followed by the results of the stats method as shown next. The first part reports on how examples are classified and the second part displays various statistics: Examples labeled as 0 classified by model as 0: 397 times Examples labeled as 0 classified by model as 1: 10 times Examples labeled as 0 classified by model as 2: 1 times Examples labeled as 1 classified by model as 0: 8 times Examples labeled as 1 classified by model as 1: 113 times Examples labeled as 1 classified by model as 2: 1 times Examples labeled as 1 classified by model as 3: 1 times Examples labeled as 2 classified by model as 1: 7 times Examples labeled as 2 classified by model as 2: 21 times Examples labeled as 2 classified by model as 3: 14 times Examples labeled as 3 classified by model as 1: 2 times Examples labeled as 3 classified by model as 3: 30 times ==========================Scores======================================== Accuracy: 0.9273 Precision: 0.854 Recall: 0.8323 F1 Score: 0.843 ======================================================================== The regression model does a reasonable job with this dataset. Summary In this article, we examined deep learning and regression analysis. We showed how to prepare the data and class, build the model, and evaluate the model. We used sample data and displayed output statistics to demonstrate the relative effectiveness of our model. Resources for Article: Further resources on this subject: KnockoutJS Templates [article] The Heart of It All [article] Bringing DevOps to Network Operations [article]
Read more
  • 0
  • 0
  • 1749
article-image-microsoft-cognitive-services
Packt
04 Jan 2017
16 min read
Save for later

Microsoft Cognitive Services

Packt
04 Jan 2017
16 min read
In this article by Leif Henning Larsen, author of the book Learning Microsoft Cognitive Services, we will look into what Microsoft Cognitive Services offer. You will then learn how to utilize one of the APIs by recognizing faces in images. Microsoft Cognitive Services give developers the possibilities of adding AI-like capabilities to their applications. Using a few lines of code, we can take advantage of powerful algorithms that would usually take a lot of time, effort, and hardware to do yourself. (For more resources related to this topic, see here.) Overview of Microsoft Cognitive Services Using Cognitive Services means you have 21 different APIs at your hand. These are in turn separated into 5 top-level domains according to what they do. They are vision, speech, language, knowledge, and search. Let's see more about them in the following sections. Vision APIs under the vision flags allows your apps to understand images and video content. It allows you to retrieve information about faces, feelings, and other visual content. You can stabilize videos and recognize celebrities. You can read text in images and generate thumbnails from videos and images. There are four APIs contained in the vision area, which we will see now. Computer Vision Using the Computer Vision API, you can retrieve actionable information from images. This means you can identify content (such as image format, image size, colors, faces, and more). You can detect whether an image is adult/racy. This API can recognize text in images and extract it to machine-readable words. It can detect celebrities from a variety of areas. Lastly, it can generate storage-efficient thumbnails with smart cropping functionality. Emotion The Emotion API allows you to recognize emotions, both in images and videos. This can allow for more personalized experiences in applications. The emotions that are detected are cross-cultural emotions: anger, contempt, disgust, fear, happiness, neutral, sadness, and surprise. Face We have already seen the very basic example of what the Face API can do. The rest of the API revolves around the same—to detect, identify, organize, and tag faces in photos. Apart from face detection, you can see how likely it is that two faces belong to the same person. You can identify faces and also find similar-looking faces. Video The Video API is about analyzing, editing, and processing videos in your app. If you have a video that is shaky, the API allows you to stabilize it. You can detect and track faces in videos. If a video contains a stationary background, you can detect motion. The API lets you to generate thumbnail summaries for videos, which allows users to see previews or snapshots quickly. Speech Adding one of the Speech APIs allows your application to hear and speak to your users. The APIs can filter noise and identify speakers. They can drive further actions in your application based on the recognized intent. Speech contains three APIs, which we will discuss now. Bing Speech Adding the Bing Speech API to your application allows you to convert speech to text and vice versa. You can convert spoken audio to text either by utilizing a microphone or other sources in real time or by converting audio from files. The API also offer speech intent recognition, which is trained by Language Understanding Intelligent Service to understand the intent. Speaker Recognition The Speaker Recognition API gives your application the ability to know who is talking. Using this API, you can use verify that someone speaking is who they claim to be. You can also determine who an unknown speaker is, based on a group of selected speakers. Custom Recognition To improve speech recognition, you can use the Custom Recognition API. This allows you to fine-tune speech recognition operations for anyone, anywhere. Using this API, the speech recognition model can be tailored to the vocabulary and speaking style of the user. In addition to this, the model can be customized to match the expected environment of the application. Language APIs related to language allow your application to process natural language and learn how to recognize what users want. You can add textual and linguistic analysis to your application as well as natural language understanding. The following five APIs can be found in the Language area. Bing Spell Check Bing Spell Check API allows you to add advanced spell checking to your application. Language Understanding Intelligent Service (LUIS) Language Understanding Intelligent Service, or LUIS, is an API that can help your application understand commands from your users. Using this API, you can create language models that understand intents. Using models from Bing and Cortana, you can make these models recognize common requests and entities (such as places, time, and numbers). You can add conversational intelligence to your applications. Linguistic Analysis Linguistic Analysis API lets you parse complex text to explore the structure of text. Using this API, you can find nouns, verbs, and more from text, which allows your application to understand who is doing what to whom. Text Analysis Text Analysis API will help you in extracting information from text. You can find the sentiment of a text (whether the text is positive or negative). You will be able to detect language, topic, and key phrases used throughout the text. Web Language Model Using the Web Language Model (WebLM) API, you are able to leverage the power of language models trained on web-scale data. You can use this API to predict which words or sequences follow a given sequence or word. Knowledge When talking about Knowledge APIs, we are talking about APIs that allow you to tap into rich knowledge. This may be knowledge from the Web, it may be academia, or it may be your own data. Using these APIs, you will be able to explore different nuances of knowledge. The following four APIs are contained in the Knowledge area. Academic Using the Academic API, you can explore relationships among academic papers, journals, and authors. This API allows you to interpret natural language user query strings, which allows your application to anticipate what the user is typing. It will evaluate the said expression and return academic knowledge entities. Entity Linking Entity Linking is the API you would use to extend knowledge of people, places, and events based on the context. As you may know, a single word may be used differently based on the context. Using this API allows you to recognize and identify each separate entity within a paragraph based on the context. Knowledge Exploration The Knowledge Exploration API will let you add the ability to use interactive search for structured data in your projects. It interprets natural language queries and offers auto-completions to minimize user effort. Based on the query expression received, it will retrieve detailed information about matching objects. Recommendations The Recommendations API allows you to provide personalized product recommendations for your customers. You can use this API to add frequently bought together functionality to your application. Another feature you can add is item-to-item recommendations, which allow customers to see what other customers who likes this also like. This API will also allow you to add recommendations based on the prior activity of the customer. Search Search APIs give you the ability to make your applications more intelligent with the power of Bing. Using these APIs, you can use a single call to access data from billions of web pages, images videos, and news. The following five APIs are in the search domain. Bing Web Search With Bing Web Search, you can search for details in billions of web documents indexed by Bing. All the results can be arranged and ordered according to the layout you specify, and the results are customized to the location of the end user. Bing Image Search Using Bing Image Search API, you can add advanced image and metadata search to your application. Results include URLs to images, thumbnails, and metadata. You will also be able to get machine-generated captions and similar images and more. This API allows you to filter the results based on image type, layout, freshness (how new is the image), and license. Bing Video Search Bing Video Search will allow you to search for videos and returns rich results. The results contain metadata from the videos, static- or motion- based thumbnails, and the video itself. You can add filters to the result based on freshness, video length, resolution, and price. Bing News Search If you add Bing News Search to your application, you can search for news articles. Results can include authoritative image, related news and categories, information on the provider, URL, and more. To be more specific, you can filter news based on topics. Bing Autosuggest Bing Autosuggest API is a small, but powerful one. It will allow your users to search faster with search suggestions, allowing you to connect powerful search to your apps. Detecting faces with the Face API We have seen what the different APIs can do. Now we will test the Face API. We will not be doing a whole lot, but we will see how simple it is to detect faces in images. The steps we need to cover to do this are as follows: Register for a free Face API preview subscription. Add necessary NuGet packages to our project. Add some UI to the test application. Detect faces on command. Head over to https://www.microsoft.com/cognitive-services/en-us/face-api to start the process of registering for a free subscription to the Face API. By clicking on the yellow button, stating Get started for free,you will be taken to a login page. Log in with your Microsoft account, or if you do not have one, register for one. Once logged in, you will need to verify that the Face API Preview has been selected in the list and accept the terms and conditions. With that out of the way, you will be presented with the following: You will need one of the two keys later, when we are accessing the API. In Visual Studio, create a new WPF application. Following the instructions at https://www.codeproject.com/articles/100175/model-view-viewmodel-mvvm-explained, create a base class that implements the INotifyPropertyChanged interface and a class implementing the ICommand interface. The first should be inherited by the ViewModel, the MainViewModel.cs file, while the latter should be used when creating properties to handle button commands. The Face API has a NuGet package, so we need to add that to our project. Head over to NuGet Package Manager for the project we created earlier. In the Browse tab, search for the Microsoft.ProjectOxford.Face package and install the it from Microsoft: As you will notice, another package will also be installed. This is the Newtonsoft.Json package, which is required by the Face API. The next step is to add some UI to our application. We will be adding this in the MainView.xaml file. First, we add a grid and define some rows for the grid: <Grid> <Grid.RowDefinitions> <RowDefinition Height="*" /> <RowDefinition Height="20" /> <RowDefinition Height="30" /> </Grid.RowDefinitions> Three rows are defined. The first is a row where we will have an image. The second is a line for status message, and the last is where we will place some buttons: Next, we add our image element: <Image x_Name="FaceImage" Stretch="Uniform" Source="{Binding ImageSource}" Grid.Row="0" /> We have given it a unique name. By setting the Stretch parameter to Uniform, we ensure that the image keeps its aspect ratio. Further on, we place this element in the first row. Last, we bind the image source to a BitmapImage interface in the ViewModel, which we will look at in a bit. The next row will contain a text block with some status text. The text property will be bound to a string property in the ViewModel: <TextBlock x_Name="StatusTextBlock" Text="{Binding StatusText}" Grid.Row="1" /> The last row will contain one button to browse for an image and one button to be able to detect faces. The command properties of both buttons will be bound to the DelegateCommand properties in the ViewModel: <Button x_Name="BrowseButton" Content="Browse" Height="20" Width="140" HorizontalAlignment="Left" Command="{Binding BrowseButtonCommand}" Margin="5, 0, 0, 5" Grid.Row="2" /> <Button x_Name="DetectFaceButton" Content="Detect face" Height="20" Width="140" HorizontalAlignment="Right" Command="{Binding DetectFaceCommand}" Margin="0, 0, 5, 5" Grid.Row="2"/> With the View in place, make sure that the code compiles and run it. This should present you with the following UI: The last part is to create the binding properties in our ViewModel and make the buttons execute something. Open the MainViewModel.cs file. First, we define two variables: private string _filePath; private IFaceServiceClient _faceServiceClient; The string variable will hold the path to our image, while the IFaceServiceClient variable is to interface the Face API. Next we define two properties: private BitmapImage _imageSource; public BitmapImage ImageSource { get { return _imageSource; } set { _imageSource = value; RaisePropertyChangedEvent("ImageSource"); } } private string _statusText; public string StatusText { get { return _statusText; } set { _statusText = value; RaisePropertyChangedEvent("StatusText"); } } What we have here is a property for the BitmapImage mapped to the Image element in the view. We also have a string property for the status text, mapped to the text block element in the view. As you also may notice, when either of the properties is set, we call the RaisePropertyChangedEvent method. This will ensure that the UI is updated when either of the properties has new values. Next, we define our two DelegateCommand objects and do some initialization through the constructor: public ICommand BrowseButtonCommand { get; private set; } public ICommand DetectFaceCommand { get; private set; } public MainViewModel() { StatusText = "Status: Waiting for image..."; _faceServiceClient = new FaceServiceClient("YOUR_API_KEY_HERE"); BrowseButtonCommand = new DelegateCommand(Browse); DetectFaceCommand = new DelegateCommand(DetectFace, CanDetectFace); } In our constructor, we start off by setting the status text. Next, we create an object of the Face API, which needs to be created with the API key we got earlier. At last, we create the DelegateCommand object for our command properties. Note how the browse command does not specify a predicate. This means it will always be possible to click on the corresponding button. To make this compile, we need to create the functions specified in the DelegateCommand constructors—the Browse, DetectFace, and CanDetectFace functions: private void Browse(object obj) { var openDialog = new Microsoft.Win32.OpenFileDialog(); openDialog.Filter = "JPEG Image(*.jpg)|*.jpg"; bool? result = openDialog.ShowDialog(); if (!(bool)result) return; We start the Browse function by creating an OpenFileDialog object. This dialog is assigned a filter for JPEG images, and in turn it is opened. When the dialog is closed, we check the result. If the dialog was cancelled, we simply stop further execution: _filePath = openDialog.FileName; Uri fileUri = new Uri(_filePath); With the dialog closed, we grab the filename of the file selected and create a new URI from it: BitmapImage image = new BitmapImage(fileUri); image.CacheOption = BitmapCacheOption.None; image.UriSource = fileUri; With the newly created URI, we want to create a new BitmapImage interface. We specify it to use no cache, and we set the URI source the URI we created: ImageSource = image; StatusText = "Status: Image loaded..."; } The last step we take is to assign the bitmap image to our BitmapImage property, so the image is shown in the UI. We also update the status text to let the user know the image has been loaded. Before we move on, it is time to make sure that the code compiles and that you are able to load an image into the View: private bool CanDetectFace(object obj) { return !string.IsNullOrEmpty(ImageSource?.UriSource.ToString()); } The CanDetectFace function checks whether or not the detect faces button should be enabled. In this case, it checks whether our image property actually has a URI. If it does, by extension that means we have an image, and we should be able to detect faces: private async void DetectFace(object obj) { FaceRectangle[] faceRects = await UploadAndDetectFacesAsync(); string textToSpeak = "No faces detected"; if (faceRects.Length == 1) textToSpeak = "1 face detected"; else if (faceRects.Length > 1) textToSpeak = $"{faceRects.Length} faces detected"; Debug.WriteLine(textToSpeak); } Our DetectFace method calls an async method to upload and detect faces. The return value contains an array of FaceRectangles. This array contains the rectangle area for all face positions in the given image. We will look into the function we call in a bit. After the call has finished executing, we print a line with the number of faces to the debug console window: private async Task<FaceRectangle[]> UploadAndDetectFacesAsync() { StatusText = "Status: Detecting faces..."; try { using (Stream imageFileStream = File.OpenRead(_filePath)) { In the UploadAndDetectFacesAsync function, we create a Stream object from the image. This stream will be used as input for the actual call to the Face API service: Face[] faces = await _faceServiceClient.DetectAsync(imageFileStream, true, true, new List<FaceAttributeType>() { FaceAttributeType.Age }); This line is the actual call to the detection endpoint for the Face API. The first parameter is the file stream we created in the previous step. The rest of the parameters are all optional. The second parameter should be true if you want to get a face ID. The next specifies if you want to receive face landmarks or not. The last parameter takes a list of facial attributes you may want to receive. In our case, we want the age parameter to be returned, so we need to specify that. The return type of this function call is an array of faces with all the parameters you have specified: List<double> ages = faces.Select(face => face.FaceAttributes.Age).ToList(); FaceRectangle[] faceRects = faces.Select(face => face.FaceRectangle).ToArray(); StatusText = "Status: Finished detecting faces..."; foreach(var age in ages) { Console.WriteLine(age); } return faceRects; } } The first line in the previous code iterates over all faces and retrieves the approximate age of all faces. This is later printed to the debug console window, in the following foreach loop. The second line iterates over all faces and retrieves the face rectangle with the rectangular location of all faces. This is the data we return to the calling function. Add a catch clause to finish the method. In case an exception is thrown, in our API call, we catch that. You want to show the error message and return an empty FaceRectangle array. With that code in place, you should now be able to run the full example. The end result will look like the following image: The resulting debug console window will print the following text: 1 face detected 23,7 Summary In this article, we looked at what Microsoft Cognitive Services offer. We got a brief description of all the APIs available. From there, we looked into the Face API, where we saw how to detect faces in images. Resources for Article: Further resources on this subject: Auditing and E-discovery [article] The Sales and Purchase Process [article] Manage Security in Excel [article]
Read more
  • 0
  • 0
  • 1734

article-image-tensorflow
Packt
04 Jan 2017
17 min read
Save for later

TensorFlow

Packt
04 Jan 2017
17 min read
In this article by Nicholas McClure, the author of the book TensorFlow Machine Learning Cookbook, we will cover basic recipes in order to understand how TensorFlow works and how to access data for this book and additional resources: How TensorFlow works Declaring tensors Using placeholders and variables Working with matrices Declaring operations (For more resources related to this topic, see here.) Introduction Google's TensorFlow engine has a unique way of solving problems. This unique way allows us to solve machine learning problems very efficiently. We will cover the basic steps to understand how TensorFlow operates. This understanding is essential in understanding recipes for the rest of this book. How TensorFlow works At first, computation in TensorFlow may seem needlessly complicated. But there is a reason for it: because of how TensorFlow treats computation, developing more complicated algorithms is relatively easy. This recipe will talk you through the pseudo code of how a TensorFlow algorithm usually works. Getting ready Currently, TensorFlow is only supported on Mac and Linux distributions. Using TensorFlow on Windows requires the usage of a virtual machine. Throughout this book we will only concern ourselves with the Python library wrapper of TensorFlow. This book will use Python 3.4+ (https://www.python.org) and TensorFlow 0.7 (https://www.tensorflow.org). While TensorFlow can run on the CPU, it runs faster if it runs on the GPU, and it is supported on graphics cards with NVidia Compute Capability 3.0+. To run on a GPU, you will also need to download and install the NVidia Cuda Toolkit (https://developer.nvidia.com/cuda-downloads). Some of the recipes will rely on a current installation of the Python packages Scipy, Numpy, and Scikit-Learn as well. How to do it… Here we will introduce the general flow of TensorFlow algorithms. Most recipes will follow this outline: Import or generate data: All of our machine-learning algorithms will depend on data. In this book we will either generate data or use an outside source of data. Sometimes it is better to rely on generated data because we will want to know the expected outcome. Transform and normalize data: The data is usually not in the correct dimension or type that our TensorFlow algorithms expect. We will have to transform our data before we can use it. Most algorithms also expect normalized data and we will do this here as well. TensorFlow has built in functions that can normalize the data for you as follows: data = tf.nn.batch_norm_with_global_normalization(...) Set algorithm parameters: Our algorithms usually have a set of parameters that we hold constant throughout the procedure. For example, this can be the number of iterations, the learning rate, or other fixed parameters of our choosing. It is considered good form to initialize these together so the reader or user can easily find them, as follows: learning_rate = 0.01 iterations = 1000 Initialize variables and placeholders: TensorFlow depends on us telling it what it can and cannot modify. TensorFlow will modify the variables during optimization to minimize a loss function. To accomplish this, we feed in data through placeholders. We need to initialize both of these, variables and placeholders with size and type, so that TensorFlow knows what to expect. See the following: code: a_var = tf.constant(42) x_input = tf.placeholder(tf.float32, [None, input_size]) y_input = tf.placeholder(tf.fload32, [None, num_classes]) Define the model structure: After we have the data, and have initialized our variables and placeholders, we have to define the model. This is done by building a computational graph. We tell TensorFlow what operations must be done on the variables and placeholders to arrive at our model predictions: y_pred = tf.add(tf.mul(x_input, weight_matrix), b_matrix) Declare the loss functions: After defining the model, we must be able to evaluate the output. This is where we declare the loss function. The loss function is very important as it tells us how far off our predictions are from the actual values: loss = tf.reduce_mean(tf.square(y_actual – y_pred)) Initialize and train the model: Now that we have everything in place, we need to create an instance for our graph, feed in the data through the placeholders and let TensorFlow change the variables to better predict our training data. Here is one way to initialize the computational graph: with tf.Session(graph=graph) as session: ... session.run(...) ... Note that we can also initiate our graph with: session = tf.Session(graph=graph) session.run(…) (Optional) Evaluate the model: Once we have built and trained the model, we should evaluate the model by looking at how well it does with new data through some specified criteria. (Optional) Predict new outcomes: It is also important to know how to make predictions on new, unseen, data. We can do this with all of our models, once we have them trained. How it works… In TensorFlow, we have to setup the data, variables, placeholders, and model before we tell the program to train and change the variables to improve the predictions. TensorFlow accomplishes this through the computational graph. We tell it to minimize a loss function and TensorFlow does this by modifying the variables in the model. TensorFlow knows how to modify the variables because it keeps track of the computations in the model and automatically computes the gradients for every variable. Because of this, we can see how easy it can be to make changes and try different data sources. See also A great place to start is the official Python API Tensorflow documentation: https://www.tensorflow.org/versions/r0.7/api_docs/python/index.html There are also tutorials available: https://www.tensorflow.org/versions/r0.7/tutorials/index.html Declaring tensors Getting ready Tensors are the data structure that TensorFlow operates on in the computational graph. We can declare these tensors as variables or feed them in as placeholders. First we must know how to create tensors. When we create a tensor and declare it to be a variable, TensorFlow creates several graph structures in our computation graph. It is also important to point out that just by creating a tensor, TensorFlow is not adding anything to the computational graph. TensorFlow does this only after creating a variable out of the tensor. See the next section on variables and placeholders for more information. How to do it… Here we will cover the main ways to create tensors in TensorFlow. Fixed tensors: Creating a zero filled tensor. Use the following: zero_tsr = tf.zeros([row_dim, col_dim]) Creating a one filled tensor. Use the following: ones_tsr = tf.ones([row_dim, col_dim]) Creating a constant filled tensor. Use the following: filled_tsr = tf.fill([row_dim, col_dim], 42) Creating a tensor out of an existing constant.Use the following: constant_tsr = tf.constant([1,2,3]) Note that the tf.constant() function can be used to broadcast a value into an array, mimicking the behavior of tf.fill() by writing tf.constant(42, [row_dim, col_dim]) Tensors of similar shape: We can also initialize variables based on the shape of other tensors, as follows: zeros_similar = tf.zeros_like(constant_tsr) ones_similar = tf.ones_like(constant_tsr) Note, that since these tensors depend on prior tensors, we must initialize them in order. Attempting to initialize all the tensors all at once will result in an error. Sequence tensors: Tensorflow allows us to specify tensors that contain defined intervals. The following functions behave very similarly to the range() outputs and numpy's linspace() outputs. See the following function: linear_tsr = tf.linspace(start=0, stop=1, start=3) The resulting tensor is the sequence [0.0, 0.5, 1.0]. Note that this function includes the specified stop value. See the following function integer_seq_tsr = tf.range(start=6, limit=15, delta=3) The result is the sequence [6, 9, 12]. Note that this function does not include the limit value. Random tensors: The following generated random numbers are from a uniform distribution: randunif_tsr = tf.random_uniform([row_dim, col_dim], minval=0, maxval=1) Know that this random uniform distribution draws from the interval that includes the minval but not the maxval ( minval<=x<maxval ). To get a tensor with random draws from a normal distribution, as follows: randnorm_tsr = tf.random_normal([row_dim, col_dim], mean=0.0, stddev=1.0) There are also times when we wish to generate normal random values that are assured within certain bounds. The truncated_normal() function always picks normal values within two standard deviations of the specified mean. See the following: runcnorm_tsr = tf.truncated_normal([row_dim, col_dim], mean=0.0, stddev=1.0) We might also be interested in randomizing entries of arrays. To accomplish this there are two functions that help us, random_shuffle() and random_crop(). See the following: shuffled_output = tf.random_shuffle(input_tensor) cropped_output = tf.random_crop(input_tensor, crop_size) Later on in this book, we will be interested in randomly cropping an image of size (height, width, 3) where there are three color spectrums. To fix a dimension in the cropped_output, you must give it the maximum size in that dimension: cropped_image = tf.random_crop(my_image, [height/2, width/2, 3]) How it works… Once we have decided on how to create the tensors, then we may also create the corresponding variables by wrapping the tensor in the Variable() function, as follows. More on this in the next section: my_var = tf.Variable(tf.zeros([row_dim, col_dim])) There's more… We are not limited to the built in functions, we can convert any numpy array, Python list, or constant to a tensor using the function convert_to_tensor(). Know that this function also accepts tensors as an input in case we wish to generalize a computation inside a function. Using placeholders and variables Getting ready One of the most important distinctions to make with data is whether it is a placeholder or variable. Variables are the parameters of the algorithm and TensorFlow keeps track of how to change these to optimize the algorithm. Placeholders are objects that allow you to feed in data of a specific type and shape or that depend on the results of the computational graph, like the expected outcome of a computation. How to do it… The main way to create a variable is by using the Variable() function, which takes a tensor as an input and outputs a variable. This is the declaration and we still need to initialize the variable. Initializing is what puts the variable with the corresponding methods on the computational graph. Here is an example of creating and initializing a variable: my_var = tf.Variable(tf.zeros([2,3])) sess = tf.Session() initialize_op = tf.initialize_all_variables() sess.run(initialize_op) To see whatthe computational graph looks like after creating and initializing a variable, see the next part in this section, How it works…, Figure 1. Placeholders are just holding the position for data to be fed into the graph. Placeholders get data from a feed_dict argument in the session. To put a placeholder in the graph, we must perform at least one operation on the placeholder. We initialize the graph, declare x to be a placeholder, and define y as the identity operation on x, which just returns x. We then create data to feed into the x placeholder and run the identity operation. It is worth noting that Tensorflow will not return a self-referenced placeholder in the feed dictionary. The code is shown below and the resulting graph is in the next section, How it works…: sess = tf.Session() x = tf.placeholder(tf.float32, shape=[2,2]) y = tf.identity(x) x_vals = np.random.rand(2,2) sess.run(y, feed_dict={x: x_vals}) # Note that sess.run(x, feed_dict={x: x_vals}) will result in a self-referencing error. How it works… The computational graph of initializing a variable as a tensor of zeros is seen in Figure 1to follow: Figure 1: Variable Figure 1: Here we can see what the computational graph looks like in detail with just one variable, initialized to all zeros. The grey shaded region is a very detailed view of the operations and constants involved. The main computational graph with less detail is the smaller graph outside of the grey region in the upper right. For more details on creating and visualizing graphs. Similarly, the computational graph of feeding a numpy array into a placeholder can be seen to follow, in Figure 2: Figure 2: Computational graph of an initialized placeholder Figure 2: Here is the computational graph of a placeholder initialized. The grey shaded region is a very detailed view of the operations and constants involved. The main computational graph with less detail is the smaller graph outside of the grey region in the upper right. There's more… During the run of the computational graph, we have to tell TensorFlow when to initialize the variables we have created. While each variable has an initializer method, the most common way to do this is with the helper function initialize_all_variables(). This function creates an operation in the graph that initializes all the variables we have created, as follows: initializer_op = tf.initialize_all_variables() But if we want to initialize a variable based on the results of initializing another variable, we have to initialize variables in the order we want, as follows: sess = tf.Session() first_var = tf.Variable(tf.zeros([2,3])) sess.run(first_var.initializer) second_var = tf.Variable(tf.zeros_like(first_var)) # Depends on first_var sess.run(second_var.initializer) Working with matrices Getting ready Many algorithms depend on matrix operations. TensorFlow gives us easy-to-use operations to perform such matrix calculations. For all of the following examples, we can create a graph session by running the following code: import tensorflow as tf sess = tf.Session() How to do it… Creating matrices: We can create two-dimensional matrices from numpy arrays or nested lists, as we described in the earlier section on tensors. We can also use the tensor creation functions and specify a two-dimensional shape for functions like zeros(), ones(), truncated_normal(), and so on: Tensorflow also allows us to create a diagonal matrix from a one dimensional array or list with the function diag(), as follows: identity_matrix = tf.diag([1.0, 1.0, 1.0]) # Identity matrix A = tf.truncated_normal([2, 3]) # 2x3 random normal matrix B = tf.fill([2,3], 5.0) # 2x3 constant matrix of 5's C = tf.random_uniform([3,2]) # 3x2 random uniform matrix D = tf.convert_to_tensor(np.array([[1., 2., 3.],[-3., -7., -1.],[0., 5., -2.]])) print(sess.run(identity_matrix)) [[ 1. 0. 0.] [ 0. 1. 0.] [ 0. 0. 1.]] print(sess.run(A)) [[ 0.96751703 0.11397751 -0.3438891 ] [-0.10132604 -0.8432678 0.29810596]] print(sess.run(B)) [[ 5. 5. 5.] [ 5. 5. 5.]] print(sess.run(C)) [[ 0.33184157 0.08907614] [ 0.53189191 0.67605299] [ 0.95889051 0.67061249]] print(sess.run(D)) [[ 1. 2. 3.] [-3. -7. -1.] [ 0. 5. -2.]] Note that if we were to run sess.run(C) again, we would reinitialize the random variables and end up with different random values. Addition and subtraction uses the following function: print(sess.run(A+B)) [[ 4.61596632 5.39771316 4.4325695 ] [ 3.26702736 5.14477345 4.98265553]] print(sess.run(B-B)) [[ 0. 0. 0.] [ 0. 0. 0.]] Multiplication print(sess.run(tf.matmul(B, identity_matrix))) [[ 5. 5. 5.] [ 5. 5. 5.]] Also, the function matmul() has arguments that specify whether or not to transpose the arguments before multiplication or whether each matrix is sparse. Transpose the arguments as follows: print(sess.run(tf.transpose(C))) [[ 0.67124544 0.26766731 0.99068872] [ 0.25006068 0.86560275 0.58411312]] Again, it is worth mentioning the reinitializing that gives us different values than before. Determinant, use the following: print(sess.run(tf.matrix_determinant(D))) -38.0 Inverse: print(sess.run(tf.matrix_inverse(D))) [[-0.5 -0.5 -0.5 ] [ 0.15789474 0.05263158 0.21052632] [ 0.39473684 0.13157895 0.02631579]] Note that the inverse method is based on the Cholesky decomposition if the matrix is symmetric positive definite or the LU decomposition otherwise. Decompositions: Cholesky decomposition, use the following: print(sess.run(tf.cholesky(identity_matrix))) [[ 1. 0. 1.] [ 0. 1. 0.] [ 0. 0. 1.]] Eigenvalues and Eigenvectors, use the following code: print(sess.run(tf.self_adjoint_eig(D)) [[-10.65907521 -0.22750691 2.88658212] [ 0.21749542 0.63250104 -0.74339638] [ 0.84526515 0.2587998 0.46749277] [ -0.4880805 0.73004459 0.47834331]] Note that the function self_adjoint_eig() outputs the eigen values in the first row and the subsequent vectors in the remaining vectors. In mathematics, this is called the eigen decomposition of a matrix. How it works… TensorFlow provides all the tools for us to get started with numerical computations and add such computations to our graphs. This notation might seem quite heavy for simple matrix operations. Remember that we are adding these operations to the graph and telling TensorFlow what tensors to run through those operations. Declaring operations Getting ready Besides the standard arithmetic operations, TensorFlow provides us more operations that we should be aware of and how to use them before proceeding. Again, we can create a graph session by running the following code: import tensorflow as tf sess = tf.Session() How to do it… TensorFlow has the standard operations on tensors, add(), sub(), mul(), and div(). Note that all of these operations in this section will evaluate the inputs element-wise unless specified otherwise. TensorFlow provides some variations of div() and relevant functions. It is worth mentioning that div() returns the same type as the inputs. This means it really returns the floor of the division (akin to Python 2) if the inputs are integers. To return the Python 3 version, which casts integers into floats before dividing and always returns a float, TensorFlow provides the function truediv()shown as follows: print(sess.run(tf.div(3,4))) 0 print(sess.run(tf.truediv(3,4))) 0.75 If we have floats and want integer division, we can use the function floordiv(). Note that this will still return a float, but rounded down to the nearest integer. The function is shown as follows: print(sess.run(tf.floordiv(3.0,4.0))) 0.0 Another important function is mod(). This function returns the remainder after division.It is shown as follows: print(sess.run(tf.mod(22.0, 5.0))) 2.0 The cross product between two tensors is achieved by the cross() function. Remember that the cross product is only defined for two 3-dimensional vectors, so it only accepts two 3-dimensional tensors. The function is shown as follows: print(sess.run(tf.cross([1., 0., 0.], [0., 1., 0.]))) [ 0. 0. 1.0] Here is a compact list of the more common math functions. All of these functions operate element-wise: abs() Absolute value of one input tensor ceil() Ceiling function of one input tensor cos() Cosine function of one input tensor exp() Base e exponential of one input tensor floor() Floor function of one input tensor inv() Multiplicative inverse (1/x) of one input tensor log() Natural logarithm of one input tensor maximum() Element-wise max of two tensors minimum() Element-wise min of two tensors neg() Negative of one input tensor pow() The first tensor raised to the second tensor element-wise round() Rounds one input tensor rsqrt() One over the square root of one tensor sign() Returns -1, 0, or 1, depending on the sign of the tensor sin() Sine function of one input tensor sqrt() Square root of one input tensor square() Square of one input tensor Specialty mathematical functions: There are some special math functions that get used in machine learning that are worth mentioning and TensorFlow has built in functions for them. Again, these functions operate element-wise, unless specified otherwise: digamma() Psi function, the derivative of the lgamma() function erf() Gaussian error function, element-wise, of one tensor erfc() Complimentary error function of one tensor igamma() Lower regularized incomplete gamma function igammac() Upper regularized incomplete gamma function lbeta() Natural logarithm of the absolute value of the beta function lgamma() Natural logarithm of the absolute value of the gamma function squared_difference() Computes the square of the differences between two tensors How it works… It is important to know what functions are available to us to add to our computational graphs. Mostly we will be concerned with the preceding functions. We can also generate many different custom functions as compositions of the preceding, as follows: # Tangent function (tan(pi/4)=1) print(sess.run(tf.div(tf.sin(3.1416/4.), tf.cos(3.1416/4.)))) 1.0 There's more… If we wish to add other operations to our graphs that are not listed here, we must create our own from the preceding functions. Here is an example of an operation not listed above that we can add to our graph: # Define a custom polynomial function def custom_polynomial(value): # Return 3 * x^2 - x + 10 return(tf.sub(3 * tf.square(value), value) + 10) print(sess.run(custom_polynomial(11))) 362 Summary Thus in this article we have implemented some introductory recipes that will help us to learn the basics of TensorFlow. Resources for Article: Further resources on this subject: Data Clustering [article] The TensorFlow Toolbox [article] Implementing Artificial Neural Networks with TensorFlow [article]
Read more
  • 0
  • 0
  • 991