Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-configuring-brokers
Packt
21 Oct 2015
18 min read
Save for later

Configuring Brokers

Packt
21 Oct 2015
18 min read
In this article by Saurabh Minni, author of Apache Kafka Cookbook, we will cover the following topics: Configuring basic settings Configuring threads and performance Configuring log settings Configuring replica settings Configuring the ZooKeeper settings Configuring other miscellaneous parameters (For more resources related to this topic, see here.) This article explains the configurations of a Kafka broker. Before we get started with Kafka, it is critical to configure it to suit us best. The best part about Kafka is that it's highly configurable. Though, most of the time you will be good to go with the default settings in place, when dealing with scale and performance, you might want to get your hands with a configuration that suits  your application best. Configuring basic settings Let's configure the basic settings for your Apache Kafka broker. Getting ready I believe you have already Kafka installed. Make a copy of the server.properties file from the config folder. Now, let's get cracking with your favorite editor. How to do it... Open your server.properties file: The first configuration that you need to change is broker.id: broker.id=0 Next, give a host name to your machine: host.name=localhost You also need to set the port number: to listen to. port=9092 Lastly, the directory for data persistence is as follows: log.dirs=/disk1/kafka-logs How it works… With these basic configuration parameters in place, your Kafka broker is ready to be setup. All you need to do is pass on this new configuration file when you start the broker as a parameter. Some of the important configurations used in the configuration files are explained here: broker.id: This should be a nonnegative integer ID. It should be unique for a cluster as it is used for all intents, purposes, and names of brokers. This also allows the broker to be moved to a different host and/or port without additional changes on the side of a consumer. Its default value is 0. host.name: The refers to the default value, which is null. If it's not specified, Kafka will bind to all interfaces in a system. If it's specified, it will bind only to a particular address. If you want clients to connect only to a particular interface, it is a good idea to specify the host name. port: This defines the port number that the Kafka broker will be listening to, to accept client connections. log.dirs: This tells the broker the directory where it should store files for the persistence of messages. You can specify multiple directories here using commas that separate locations. The default value for this is /tmp/kafka-logs. There's more… Kafka also lets you specify two more parameters, which are very interesting: advertised.host.name: This is the hostname that is given out to producers, consumers, and other brokers to connect to. Usually, this is the same as host.name and you need not specify it. advertised.port: This specifies the port that other producers, consumers, and brokers need to connect to. If not specified, it uses the one mentioned in the port configuration parameters. The real use case of the preceding parameters is when you make use of bridged connections where your internal host.name and port number might be different from one that external parties need to connect to. Configuring threads and performance When using Kafka, these settings are something you need not modify. However, when you want to extract every last bit of performance from your machines, this comes in handy. Getting ready You are all set with your broker properties file and are set to edit it in your favorite editor. How to do it... Open your server.properties file. Change message.max.bytes: message.max.bytes=1000000 Set the number of network threads: num.network.threads=3 Set the number of IO threads: num.io.threads=8 Set the number of threads that perform background processing: background.threads=10 Set the maximum number of requests to be queued up: queued.max.requests=500 Set the send socket buffer size: socket.send.buffer.bytes=102400 Set the receive socket buffer size: socket.receive.buffer.bytes=102400 Set the maximum request size: socket.request.max.bytes=104857600 Set the number of partitions: num.partitions=1 How it works… These network and performance configurations are set to an optimal level for your application. You might need to experiment a little to come up with an optimal configuration. Here are some explanations for these confugurations: message.max.bytes: This sets the maximum size of the message that a server can receive. This should be set in order to prevent any producer from inadvertently sending extra large messages and swamping consumers. The default size should be set to 1000000. num.network.threads: This sets the number of threads running to handle a network request. If you have too many requests coming in, you need to change this value. Else, you are good to go in most use cases. The default value for this should be set to 3. num.io.threads: This sets the number of threads that are spawned for IO operations. This should be set to at least the number of disks that are present. The default value for this should be set to 8. background.threads: This sets the number of threads that run various background jobs. These include deleting old log files. The default value is 10 and you might not need to change this. queued.max.requests: This sets the size of the queue that holds pending messages, while others are processed by IO threads. If the queue is full, the network threads will stop accepting any more messages. If you have erratic loads in your application, you need to set it to some value at which this does not throttle. socket.send.buffer.bytes: This sets the SO_SNDBUFF buffer size, which is used for socket connections. socket.receive.buffer.bytes: This sets the SO_RCVBUFF buffer size, which is used for socket connections. socket.request.max.bytes: This sets the maximum request size for a server to receive. This should be smaller than the Java heap size that you have set. num.partitions: This sets the number of default partitions for any topic you create without explicitly mentioning any partition size. There's more You might also need to configure your Java installation for maximum performance. This includes settings for heap, socket size, and so on. Configuring log settings Log settings are perhaps the most important configurations that you need to change based on your system requirements. Getting ready Just open the server.properties file in your favorite editor. How to do it... Open your server.properties file. Here are the default values for it: Change the log.segment.bytes value: log.segment.bytes=1073741824 Set the log.roll.{ms,hours} value: log.roll.{ms,hours}=168 hours Set the log.cleanup.policy value: log.cleanup.policy=delete Set the log.retention.{ms,minutes,hours} value: log.retention.{ms,minutes,hours}=168 hours Set the log.retention.bytes value: log.retention.bytes=-1 Set the log.retention.check.interval.ms value: log.retention.check.interval.ms= 30000 Set the log.cleaner.enable value: log.cleaner.enable=false Set the log.cleaner.threads value: log.cleaner.threads=1 Set the log.cleaner.backoff.ms value: log.cleaner.backoff.ms=15000 Set the log.index.size.max.bytes value: log.index.size.max.bytes=10485760 Set the log.index.interval.bytes value: log.index.interval.bytes=4096 Set the log.flush.interval.messages value: log.flush.interval.messages=Long.MaxValue Set the log.flush.interval.ms value: log.flush.interval.ms=Long.MaxValue How it works… Here is the explanation of log settings: log.segment.bytes: This defines the maximum segment size in bytes. Once a segment reaches a particular size, a new segment file is created. A topic is stored as a bunch of segment files in a directory. This can also be set on a per topic basis. Its default value is 1 GB. log.roll.{ms,hours}: This sets the time period after which a new segment file is created even if it has not reached the required size limit. This setting can also be set on a per topic basis. Its default value is 7 days. log.cleanup.policy: The value for this can be either deleted or compacted. With the delete option set, log segments are deleted periodically when it reaches its time threshold or size limit. If a compact option is set, log compaction will be used to clean up obsolete records. This setting can be set on a per topic basis. log.retention.{ms,minutes,hours}: This sets the amount of time that logs segments are retained. This can be set on a per topic basis. The default value for this is 7 days. log.retention.bytes: This sets the maximum number of byte logs per partition that are retained before they are deleted. This value can be set for a per topic basis. When either of the log time or size limits are reached, segments are deleted. log.retention.check.interval.ms: This sets the time interval at which logs are checked for deletion to meet retention policies. The default value for this is 5 minutes. log.cleaner.enable: For log compaction to be enabled, this has to be set as true. log.cleaner.threads: This sets the number of threads that work to clean logs for compaction. log.cleaner.backoff.ms: This defines the interval at which logs check if any other logs need cleaning. log.index.size.max.bytes: This settings sets the maximum size allowed for the offset index of each log segment. This can be set for per topic basis as well. log.index.interval.bytes: This defines the byte interval at which a new entry is added to the offset index. For each fetch request, the broker performs a linear scan for a particular number of bytes to find the correct position in the log to begin and end a fetch. Setting this as a larger value will mean larger index files (and a bit more memory usage) but less scanning. log.flush.interval.messages: This is the number of messages that are kept in memory till they're flushed to the disk. Though this does not guarantee durability, it gives finer control. log.flush.interval.ms: This sets the time interval at which the messages are flushed to the disk. There's more Some other settings are listed at http://kafka.apache.org/documentation.html#brokerconfigs. See also More on log compassion is available at http://kafka.apache.org/documentation.html#compaction. Configuring replica settings You will also want set up a replica for reliability purposes. Let's see some of the important settings that you need to handle for replication to work best for you. Getting ready Open the server.properties file in your favorite editor. How to do it... Open your server.properties file. Here are default values for the settings: Set the default.replication.factor value: default.replication.factor=1 Set the replica.lag.time.max.ms value: replica.lag.time.max.ms=10000 Set the replica.lag.max.messages value: replica.lag.max.messages=4000 Set the replica.fetch.max.bytes value: replica.fetch.max.bytes=1048576 Set the replica.fetch.wait.max.ms value: replica.fetch.wait.max.ms=500 Set the num.replica.fetchers value: num.replica.fetchers=1 Set the replica.high.watermark.checkpoint.interval.ms value: replica.high.watermark.checkpoint.interval.ms=5000 Set the fetch.purgatory.purge.interval.requests value: fetch.purgatory.purge.interval.requests=1000 Set the producer.purgatory.purge.interval.requests value: producer.purgatory.purge.interval.requests=1000 Set the replica.socket.timeout.ms value: replica.socket.timeout.ms=30000 Set the replica.socket.receive.buffer.bytes value: replica.socket.receive.buffer.bytes=65536 How it works… Here is the explanation of the preceding settings: default.replication.factor: This sets the default replication factor for automatically created topics. replica.lag.time.max.ms: This is time period within which if a leader does not receive any fetch request, its moved out of in-sync replicas and is treated as dead. replica.lag.max.messages: This is maximum number of messages a follower can be behind the leader by before it is considered dead and not in-sync. replica.fetch.max.bytes: This sets the maximum number of bytes of data that a follower will fetch in a request from its leader. replica.fetch.wait.max.ms: This sets the maximum amount of time for the leader to respond to a replica's fetch request. num.replica.fetchers: This specifies the number of threads used to replicate messages from the leader. Increasing the number of threads increases the IO rate to a degree. replica.high.watermark.checkpoint.interval.ms: This specifies the frequency with which each replica saves its high watermark to disk for recovery. fetch.purgatory.purge.interval.requests: This sets the fetch request purgatory's purge interval. This purgatory is the place where the fetch requests are kept on hold till they can be serviced. producer.purgatory.purge.interval.requests: This sets the producer request purgatory's purge interval. This purgatory is the place where the producer requests are kept on hold till they have been serviced. There's more Some other settings are listed at http://kafka.apache.org/documentation.html#brokerconfigs. Configuring the ZooKeeper settings ZooKeeper is used in Kafka for cluster management and to maintain the details of topics. Getting ready Just open the server.properties file in your favorite editor. How to do it… Open your server.properties file. Here are the default values for the settings: Set the zookeeper.connect property: zookeeper.connect=127.0.0.1:2181,192.168.0.32:2181 Set the zookeeper.session.timeout.ms property: zookeeper.session.timeout.ms=6000 Set the zookeeper.connection.timeout.ms property: zookeeper.connection.timeout.ms=6000 Set the zookeeper.sync.time.ms property: zookeeper.sync.time.ms=2000 How it works… Here is the explanation of these settings: zookeeper.connect: This is where you specify the ZooKeeper connection string in the form of hostname:port. You can use comma-separated values to specify multiple ZooKeeper nodes. This ensures reliability and continuity of Kafka clusters even in the event of a ZooKeeper node being down. ZooKeeper allows you to use the chroot path to make a particular Kafka data available only under a particular path. This enables you to have the same ZooKeeper clusters support multiple Kafka clusters. Here is the method to specify connection a string in this case: host1:port1,host2:port2,host3:port3/chroot/path The preceding statement puts all the cluster data in the /chroot/path path. This path must be created prior to starting Kafka clusters and users must use the same string. zookeeper.session.timeout.ms: This specifies the time within which if the heartbeat from a server is not received, then it is considered dead. The value for this must be carefully selected because if this heartbeat has too long an interval, it will not be able to detect a dead server in time and also lead to issues. Also, if the time period is too small, a live server might be considered dead. zookeeper.connection.timeout.ms: This specifies the maximum connection time that a client waits to accept a connection. zookeeper.sync.time.ms property: This specifies the time period by which a ZooKeeper follower can be behind its leader The ZooKeeper management details from the Kafka perspective are highlighted at http://kafka.apache.org/documentation.html#zk. You can find ZooKeeper at https://zookeeper.apache.org/ See also Configuring other miscellaneous parameters Besides the configurations mentioned previously, there are some other configurations that also need to be set. Getting ready Open the server.properties file in your favorite editor. We will look at the default values of the properties in the following section. How to do it... Set the auto.create.topics.enable property: auto.create.topics.enable=true Set the controlled.shutdown.enable property: controlled.shutdown.enable=true Set the controlled.shutdown.max.retries property: controlled.shutdown.max.retries=3 Set the controlled.shutdown.retry.backoff.ms property: controlled.shutdown.retry.backoff.ms=5000 Set the auto.leader.rebalance.enable property: auto.leader.rebalance.enable=true Set the leader.imbalance.per.broker.percentage property: leader.imbalance.per.broker.percentage=10 Set the leader.imbalance.check.interval.seconds property: leader.imbalance.check.interval.seconds=300 Set the offset.metadata.max.bytes property: offset.metadata.max.bytes=4096 Set the max.connections.per.ip property: max.connections.per.ip=Int.MaxValue Set the connections.max.idle.ms property: connections.max.idle.ms=600000 Set the unclean.leader.election.enable property: unclean.leader.election.enable=true Set the offsets.topic.num.partitions property: offsets.topic.num.partitions=50 Set the offsets.topic.retention.minutes property: offsets.topic.retention.minutes=1440 Set the offsets.retention.check.interval.ms property: offsets.retention.check.interval.ms=600000 Set the offsets.topic.replication.factor property: offsets.topic.replication.factor=3 Set the offsets.topic.segment.bytes property: offsets.topic.segment.bytes=104857600 Set the offsets.load.buffer.size property: offsets.load.buffer.size=5242880 Set the offsets.commit.required.acks property: offsets.commit.required.acks=-1 Set the offsets.commit.timeout.ms property: offsets.commit.timeout.ms=5000 How it works… An explanation of the settings is as follows. auto.create.topics.enable: Setting this value to true will make sure that if you fetch metadata or produce messages for a nonexistent topic, it will automatically be created. Ideally, in a production environment, you should set this value to false. controlled.shutdown.enable: This is set to true to make sure that when shutdown is called on the broker, if it's the leader of any topic, then it gracefully moves all leaders to a different broker before it shuts down. This increases the availability of the system overall. controlled.shutdown.max.retries: This sets the maximum number of retries that the broker makes to perform a controlled shutdown before performing an unclean one. controlled.shutdown.retry.backoff.ms: This sets the backoff time between controlled shutdown retries. auto.leader.rebalance.enable: If this is set to true, the broker will automatically try to balance the leadership of partitions among other brokers by periodically giving leadership to the preferred replica of each partition if it's available. leader.imbalance.per.broker.percentage: This sets the percentage of leader imbalance that's allowed per broker. The cluster will rebalance the leadership if this ratio goes above the set value. leader.imbalance.check.interval.seconds: This defines the time period for checking leader imbalance. offset.metadata.max.bytes: This defines the maximum amount of metadata allowed to the client to be stored with their offset. max.connections.per.ip: This sets the maximum number of connections that the broker accepts from a given IP address. connections.max.idle.ms: This sets the maximum time till which the broker will be idle before it closes a socket connection unclean.leader.election.enable: This is set to true to allow replicas that are not in-sync replicas (ISR) in order to be allowed to become the leader. This can lead to data loss. This is the last option for the cluster, though. offsets.topic.num.partitions: This sets the number of partitions for the offset commit topic. This cannot be changed post deployment, so its suggested that the number be set to a higher limit. The default value for this is 50. offsets.topic.retention.minutes: This sets offsets that are older than present time be marked for deletion. Actual deletion occurs when a log cleaner run the compaction of an offset topic. offsets.retention.check.interval.ms: This sets the time interval for the checking of stale offsets. offsets.topic.replication.factor: This sets the replication factor for the offset commit topic. The higher the value, the higher the availability. If at the time of creation of an offset topic, the number of brokers is lower than the replication factor, the number of replicas created will be equal to the brokers. offsets.topic.segment.bytes: This sets the segment size for offset topics. This, if kept low, leads to faster log compaction and loads. offsets.load.buffer.size: This sets the buffer size that's to be used for reading offset segments into offset manager's cache. offsets.commit.required.acks: This sets the number of acknowledgements that are required before an offset commit can be accepted. offsets.commit.timeout.ms: This sets the time after which an offset commit will be performed in case the required number of replicas have not received the offset commit. See also There are more broker configurations that are available. Read more about them at http://kafka.apache.org/documentation.html#brokerconfigs. Summary In this article, we discussed setting basic configurations for the Kafka broker, configuring and managing threads, performance, logs, and replicas. We also discussed ZooKeeper settings that are used for cluster management and some miscellaneous parameter settings. Resources for Article: Further resources on this subject: Writing Consumers [article] Introducing Kafka [article] Testing With Groovy [article]
Read more
  • 0
  • 0
  • 2008

article-image-understanding-text-search-and-hierarchies-sap-hana
Packt
20 Oct 2015
9 min read
Save for later

Understanding Text Search and Hierarchies in SAP HANA

Packt
20 Oct 2015
9 min read
In this article by Vinay Singh, author of the book Real Time Analytics with SAP HANA, this article covers Full Text Search and hierarchies in SAP HANA, and how to create and use them in our data models. After completing this article, you should be able to: Create and use Full Text Search Create hierarchies—level and parent child hierarchies (For more resources related to this topic, see here.) Creating and using Full Text Search Before we proceed with the creation and use of Full Text Search, let's quickly go through the basic terms associated with it. They are as follows: Text Analysis: This is the process of analyzing unstructured text, extracting relevant information, and then transforming this information into structure information that can be leveraged in different ways. The scripts provide additional possibilities to analyze strings or large text columns by providing analysis rules for many industries in many languages in SAP HANA. Full Text Search: This capability of HANA helps to speed up search capabilities within large amounts of text data significantly. The primary function of Full Text Search is to optimize linguistic searches. Fuzzy Search: This functionality enables to find strings that match a pattern approximately (rather than exactly). It's a fault-tolerant search, meaning that a query returns records even if the search term contains additional or missing characters, or even spelling mistakes. It is an alternative to a non-fault tolerant SQL statement. The score() function: When using contains() in the where clause of a select statement, the score() function can be used to retrieve the score. This is a numeric value between 0.0 and 1.0. The score defines the similarity between the user input and the records returned by the search. A score of 0.0 means that there is no similarity. The higher the score, the more similar a record is to the search input. Some of the applied applications of fuzzy search could be: Fault-tolerant check for duplicate records. Its helps to prevent duplication entry in Systems by searching similar entries. Fault-tolerant search in text columns—for example, search documents on diode and find all documents that contain the term "triode". Fault-tolerant search in structure database content search for rhyming words, for example coffee Krispy biscuit and find toffee crisp biscuits (the standard example given by SAP). Let's see what are the use cases for text search: Combining structure and unstructured data Medicine and healthcare Patents Brand monitoring and the buying pattern of consumer Real-time analytics on a large volume of data Data from social media Finance data Sales optimization Monitoring and production planning The results of text analysis are stored in a table and therefore, can be leveraged in all the HANA- supported scenarios: Standard Analytics: Create analytical views and calculation views on top. For example, companies mentioned in news articles over time. Data mining, predictive: Using R, Predictive Analysis Library (PAL) functions. For example, clustering, time series analysis, and so on. Search-based applications: Create a search model and build a search UI with the HANA Info Access (InA) toolkit for HTML5. Text analysis results can be used to navigate and filter search results. For example, People finder, search UI for internal documents. The capabilities of HANA Full Text Search and text analysis are as follows: Native full text search Database text analysis The graphical modeling of search models Info Access toolkit for HTML5 UIs. The benefits of full text search: Extract unstructured content with no additional cost Combine structure and unstructured information for unified information access Less data duplication and transfer Harness the benefit of InA (Info Access toolkit ) for an HTML5 application The following are the supported data types by fuzzy search: Short text Text VARCHAR NVARCHAR Date Data with full text index. Enabling search option Before we can use the search option in any attribute or analytical view, we will need to enable this functionality in the SAP HANA Studio Preferences as shown in the following screenshot: We are well prepared to move ahead with the creation and use of Full Text search. Let's do this step by step as follows: Create the table that we will use to perform the Full Text Search statements: Create Schema <DEMO>; // I am creating , it would be already present from our previous exercises. SET SCHEMA DEMO; // Set the schema name Create a Column Table including FUZZY SEARCH indexed columns. DROP TABLE DEMO.searchtbl_FUZZY; CREATE COLUMN TABLE DEMO.searchtbl_FUZZY ( CUST_NAME TEXT FUZZY SEARCH INDEX ON, CUST_COUNTY TEXT FUZZY SEARCH INDEX ON, CUST_DEPT TEXT FUZZY SEARCH INDEX ON, ); Prepare the fuzzy search logic (SQL logic): Search for customers in the countries that contain the 'MAIN' word: SELECT score() AS score, * FROM searchtbl_FUZZY WHERE CONTAINS(cust_county, 'MAIN'); Search for customers in the countries that contain the 'MAIN' word but with Fuzzy parameter 0.4 SELECT score() AS score, * FROM searchtbl_FUZZY WHERE CONTAINS(cust_county, 'West', FUZZY(0.3)); Perform a fuzzy search for a customer working in a department that includes the department word : SELECT highlighted(cust_dept), score() AS score, * FROM searchtbl_FUZZY WHERE CONTAINS(cust_dept, 'Department', FUZZY(0.5)); Fuzzy search for all the columns by looking for the customer word: SELECT score() AS score, * FROM searchtbl_FUZZY WHERE CONTAINS(*, 'Customer', FUZZY(0.5)); Creating hierarchies Hierarchies are created to maintain data in a structured format, such as maintaining customer or employee data based on their roles and splitting the data based on geographies. Hierarchical data is very useful for organizational purposes during decision making. Two types of hierarchies can be created in SAP HANA: The level hierarchy Parent-child hierarchy The hierarchies are initially created in the attribute view and later can be combined in the analytic view or calculation view for consumption in a report as per business requirements. Let's create both types of hierarchies in attribute views. Creating level hierarchy Each level represents a position in the hierarchy. For example, a time dimension might have a hierarchy that represents data at the month, quarter, and year levels. Each level above the base level contains aggregate values for the levels below it. Create a new attribute view (for your own practice, I would suggest you to create a new one). You can also use an existing one. Use the SNWD_PD EPM sample tables. In output view, mark the following as output: In the semantic node of the view, create new hierarchy as shown in the following screenshot and fill the details: Save and Activate the view. Now the hierarchy is ready to be used in an analytical view. Add a client and node key again as output to your attribute view that you just created, that is AT_LEVEL_HIERARCY_DEMO, as we will use these two fields in Create an analytical view. It should look like the following screenshot. Add the attribute view created in the preceding step and the SNWD_SO_I table to the data foundation: Join client to client and product guide to node key:  Save and activate. Go to MS Excel | All Programs | Microsoft Office | Microsoft Excel 2010 then go to Data tab | From Other Sources | From Data Connection Wizard. You will get a new popup for Data Connection Wizard | Other/Advanced | SAP HANA MDX Provider: You will be asked to provide the connection details, fill the details, and test the connection (these are the same details that you used while adding the system to SAP HANA Studio). Data Connection Wizard will now ask you to choose the analytical view (choose the one that you just created in the preceding step): The preceding steps will take you to an excel sheet and you will see data as per the choices that you chose in the Pivot table field list: Create parent-child hierarchy The parent-child hierarchy is a simple, two-level hierarchy where the child element has an attribute containing the parent element. These two columns define the hierarchical relationships among the members of the dimension. The first column, called the member key column, identifies each dimension member. The other column, called the parent column, identifies the parent of each dimension member. The parent attribute determines the name of each level in the parent-child hierarchy and determines whether the data for parent members should be displayed  Let's create a parent-child hierarchy using the following steps: Create an attribute view. Create a table that has the parent-child information: The following is the sample code and the insert statement: CREATE COLUMN TABLE "DEMO"."CCTR_HIE"( "CC_CHILD" NVARCHAR(4), "CC_PARENT" NVARCHAR(4)); insert into "DEMO"."CCTR_HIE" values('','') insert into "DEMO"."CCTR_HIE" values('C11','c1'); insert into "DEMO"."CCTR_HIE" values('C12','c1'); insert into "DEMO"."CCTR_HIE" values('C13','c1'); insert into "DEMO"."CCTR_HIE" values('C14','c2'); insert into "DEMO"."CCTR_HIE" values('C21','c2'); insert into "DEMO"."CCTR_HIE" values('C22','c2'); insert into "DEMO"."CCTR_HIE" values('C31','c3'); insert into "DEMO"."CCTR_HIE" values('C1','c'); insert into "DEMO"."CCTR_HIE" values('C2','c'); insert into "DEMO"."CCTR_HIE" values('C3','c'); We will put the preceding table into our data foundation of attribute view as follows: Make CC_CHILD as the key attribute. Now let's create new hierarchy as shown in the following screenshot: Save and activate the hierarchy. Create a new analytical view and add the HIE_PARENT_CHILD_DEMO view and the CCTR_COST table in data foundation. Join CCTR to CCTR_CILD with many is to one relationship. Make sure that in the semantic node, COST is set as a measure. Save and Activate the analytical view. Preview the data. As per the business need, we can use one of the two hierarchies along with attribute view or analytical view. Summary In this article, we took a deep dive into Full Text Search, fuzzy logic, and hierarchies concepts. We learned how to create and use text search and fuzzy logic. The parent-child and level hierarchies were discussed in detail with a hands-on approach on both. Resources for Article: Further resources on this subject: Sabermetrics with Apache Spark [article] Meeting SAP Lumira [article] Achieving High-Availability on AWS Cloud [article]
Read more
  • 0
  • 0
  • 5729

article-image-qlikview-tips-and-tricks
Packt
20 Oct 2015
6 min read
Save for later

QlikView Tips and Tricks

Packt
20 Oct 2015
6 min read
In this article by Andrew Dove and Roger Stone, author of the book QlikView Unlocked, we will cover the following key topics: A few coding tips The surprising data sources Include files Change logs (For more resources related to this topic, see here.) A few coding tips There are many ways to improve things in QlikView. Some are techniques and others are simply useful things to know or do. Here are a few of our favourite ones. Keep the coding style constant There's actually more to this than just being a tidy developer. So, always code your function names in the same way—it doesn't matter which style you use (unless you have installation standards that require a particular style). For example, you could use MonthStart(), monthstart(), or MONTHSTART(). They're all equally valid, but for consistency, choose one and stick to it. Use MUST_INCLUDE rather than INCLUDE This feature wasn't documented at all until quite a late service release of v11.2; however, it's very useful. If you use INCLUDE and the file you're trying to include can't be found, QlikView will silently ignore it. The consequences of this are unpredictable, ranging from strange behaviour to an outright script failure. If you use MUST_INCLUDE, QlikView will complain that the included file is missing, and you can fix the problem before it causes other issues. Actually, it seems strange that INCLUDE doesn't do this, but Qlik must have its reasons. Nevertheless, always use MUST_INCLUDE to save yourself some time and effort. Put version numbers in your code QlikView doesn't have a versioning system as such, and we have yet to see one that works effectively with QlikView. So, this requires some effort on the part of the developer. Devise a versioning system and always place the version number in a variable that is displayed somewhere in the application. Updating this number every time you make a change doesn't matter, but ensure that it's updated for every release to the user and ties in with your own release logs. Do stringing in the script and not in screen objects We would have put this in anyway, but its place in the article was assured by a recent experience on a user site. They wanted four lines of address and a postcode strung together in a single field, with each part separated by a comma and a space. However, any field could contain nulls; so, to avoid addresses such as ',,,,' or ', Somewhere ,,,', there had be a check for null in every field as the fields were strung together. The table only contained about 350 rows, but it took 56 seconds to refresh on screen when the work was done in an expression in a straight table. Moving the expression to the script and presenting just the resulting single field on screen took only 0.14 seconds. (That's right; it's about a seventh of a second). Plus, it didn't adversely affect script performance. We can't think of a better example of improving screen performance. The surprising data sources QlikView will read database tables, spreadsheets, XML files, and text files, but did you know that it can also take data from a web page? If you need some standard data from the Internet, there's no need to create your own version. Just grab it from a web page! How about ISO Country Codes? Here's an example. Open the script and click on Web files… below Data from Filesto the right of the bottom section of the screen. This will open the File Wizard: Source dialogue, as in the following screenshot. Enter the URL where the table of data resides: Then, click on Next and in this case, select @2 under Tables, as shown in the following screenshot: Click on Finish and your script will look something similar to this: LOAD F1, Country, A2, A3, Number FROM [http://www.airlineupdate.com/content_public/codes/misc_codes/icao _nat.htm] (html, codepage is 1252, embedded labels, table is @2); Now, you've got a great lookup table in about 30 seconds; it will take another few seconds to clean it up for your own purposes. One small caveat though—web pages can change address, content, and structure, so it's worth putting in some validation around this if you think there could be any volatility. Include files We have already said that you should use MUST_INCLUDE rather than INCLUDE, but we're always surprised that many developers never use include files at all. If the same code needs to be used in more than one place, it really should be in an include file. Suppose that you have several documents that use C:QlikFilesFinanceBudgets.xlsx and that the folder name is hard coded in all of them. As soon as the file is moved to another location, you will have several modifications to make, and it's easy to miss changing a document because you may not even realise it uses the file. The solution is simple, very effective, and guaranteed to save you many reload failures. Instead of coding the full folder name, create something similar to this: LET vBudgetFolder='C:QlikFilesFinance'; Put the line into an include file, for instance, FolderNames.inc. Then, code this into each script as follows: $(MUST_INCLUDE=FolderNames.inc) Finally, when you want to refer to your Budgets.xlsx spreadsheet, code this: $(vBudgetFolder)Budgets.xlsx Now, if the folder path has to change, you only need to change one line of code in the include file, and everything will work fine as long as you implement include files in all your documents. Note that this works just as well for folders containing QVD files and so on. You can also use this technique to include LOAD from QVDs or spreadsheets because you should always aim to have just one version of the truth. Change logs Unfortunately, one of the things QlikView is not great at is version control. It can be really hard to see what has been done between versions of a document, and using the -prj folder feature can be extremely tedious and not necessarily helpful. So, this means that you, as the developer, need to maintain some discipline over version control. To do this, ensure that you have an area of comments that looks something similar to this right at the top of your script: // Demo.qvw // // Roger Stone - One QV Ltd - 04-Jul-2015 // // PURPOSE // Sample code for QlikView Unlocked - Chapter 6 // // CHANGE LOG // Initial version 0.1 // - Pull in ISO table from Internet and local Excel data // // Version 0.2 // Remove unused fields and rename incoming ISO table fields to // match local spreadsheet // Ensure that you update this every time you make a change. You could make this even more helpful by explaining why the change was made and not just what change was made. You should also comment the expressions in charts when they are changed. Summary In this article, we covered few coding tips, the surprising data sources, include files, and change logs. Resources for Article: Further resources on this subject: Qlik Sense's Vision [Article] Securing QlikView Documents [Article] Common QlikView script errors [Article]
Read more
  • 0
  • 0
  • 2741
Visually different images

article-image-sql-server-powershell
Packt
19 Oct 2015
8 min read
Save for later

SQL Server with PowerShell

Packt
19 Oct 2015
8 min read
In this article by Donabel Santos, author of the book, SQL Server 2014 with Powershell v5 Cookbook explains scripts and snippets of code that accomplish basic SQL Server tasks using PowerShell. She discusses simple tasks such as Listing SQL Server Instances and Discovering SQL Server Services to make you comfortable working with SQL Server programmatically. However, even if ever you explore how to create some common database objects using PowerShell, keep in mind that PowerShell will not always be the best tool for the task. There will be tasks that are best completed using T-SQL. It is still good to know what is possible in PowerShell and how to do them, so you know that you have alternatives depending on your requirements or situation. For the recipes, we are going to use PowerShell ISE quite a lot. If you prefer running the script from the PowerShell console rather run running the commands from the ISE, you can save the scripts in a .ps1 file and run it from the PowerShell console. (For more resources related to this topic, see here.) Listing SQL Server Instances In this recipe, we will list all SQL Server Instances in the local network. Getting ready Log in to the server that has your SQL Server development instance as an administrator. How to do it... Let's look at the steps to list your SQL Server instances: Open PowerShell ISE as administrator. Let's use the Start-Service cmdlet to start the SQL Browser service: Import-Module SQLPS -DisableNameChecking #out of the box, the SQLBrowser is disabled. To enable: Set-Service SQLBrowser -StartupType Automatic #sql browser must be installed and running for us #to discover SQL Server instances Start-Service "SQLBrowser" Next, you need to create a ManagedComputer object to get access to instances. Type the following script and run: $instanceName = "localhost" $managedComputer = New-Object Microsoft.SqlServer.Management.Smo.Wmi.ManagedComputer $instanceName #list server instances $managedComputer.ServerInstances Your result should look similar to the one shown in the following screenshot: Notice that $managedComputer.ServerInstances gives you not only instance names, but also additional properties such as ServerProtocols, Urn, State, and so on. Confirm that these are the same instances you see from SQL Server Management Studio. Open SQL Server Management Studio. Go to Connect | Database Engine. In the Server Name dropdown, click on Browse for More. Select the Network Servers tab and check the instances listed. Your screen should look similar to this: How it works... All services in a Windows operating system are exposed and accessible using Windows Management Instrumentation (WMI). WMI is Microsoft's framework for listing, setting, and configuring any Microsoft-related resource. This framework follows Web-based Enterprise Management (WBEM). The DISTRIBUTED MANAGEMENT TASK FORCE, INC. (http://www.dmtf.org/standards/wbem) defines WBEM as follows: A set of management and Internet standard technologies developed to unify the management of distributed computing environments. WBEM provides the ability for the industry to deliver a well-integrated set of standard-based management tools, facilitating the exchange of data across otherwise disparate technologies and platforms. In order to access SQL Server WMI-related objects, you can create a WMI ManagedComputer instance: $managedComputer = New-Object Microsoft.SqlServer.Management.Smo.Wmi.ManagedComputer $instanceName The ManagedComputer object has access to a ServerInstance property, which in turn lists all available instances in the local network. These instances however are only identifiable if the SQL Server Browser service is running. The SQL Server Browser is a Windows Service that can provide information on installed instances in a box. You need to start this service if you want to list the SQL Server-related services. There's more... The Services instance of the ManagedComputer object can also provide similar information, but you will have to filter for the server type SqlServer: #list server instances $managedComputer.Services | Where-Object Type –eq "SqlServer" | Select-Object Name, State, Type, StartMode, ProcessId Your result should look like this: Instead of creating a WMI instance by using the New-Object method, you can also use the Get-WmiObject cmdlet when creating your variable. Get-WmiObject, however, will not expose exactly the same properties exposed by the Microsoft.SqlServer.Management.Smo.Wmi.ManagedComputer object. To list instances using Get-WmiObject, you will need to discover what namespace is available in your environment: $hostName = "localhost" $namespace = Get-WMIObject -ComputerName $hostName -Namespace rootMicrosoftSQLServer -Class "__NAMESPACE" | Where-Object Name -like "ComputerManagement*" #see matching namespace objects $namespace #see namespace names $namespace | Select-Object -ExpandProperty "__NAMESPACE" $namespace | Select-Object -ExpandProperty "Name" If you are using PowerShell v2, you will have to change the Where-Object cmdlet usage to use the curly braces {} and the $_ variable: Where-Object {$_.Name -like "ComputerManagement*" } For SQL Server 2014, the namespace value is: ROOTMicrosoftSQLServerComputerManagement12 This value can be derived from $namespace.__NAMESPACE and $namespace.Name. Once you have the namespace, you can use this with Get-WmiObject to retrieve the instances. We can use the SqlServiceType property to filter. According to MSDN (http://msdn.microsoft.com/en-us/library/ms179591.aspx), these are the values of SqlServiceType: SqlServiceType Description 1 SQL Server Service 2 SQL Server Agent Service 3 Full-Text Search Engine Service 4 Integration Services Service 5 Analysis Services Service 6 Reporting Services Service 7 SQL Browser Service Thus, to retrieve the SQL Server instances, we need to provide the full namespace ROOTMicrosoftSQLServerComputerManagement12. We also need to filter for SQL Server Service type, or SQLServiceType = 1. The code is as follows: Get-WmiObject -ComputerName $hostName -Namespace "$($namespace.__NAMESPACE)$($namespace.Name)" -Class SqlService | Where-Object SQLServiceType -eq 1 | Select-Object ServiceName, DisplayName, SQLServiceType | Format-Table –AutoSize Your result should look similar to the following screenshot: Yet another way to list all the SQL Server instances in the local network is by using the System.Data.Sql.SQLSourceEnumerator class, instead of ManagedComputer. This class has a static method called Instance.GetDataSources that will list all SQL Server instances: [System.Data.Sql.SqlDataSourceEnumerator]: :Instance.GetDataSources() | Format-Table -AutoSize When you execute, your result should look similar to the following: If you have multiple SQL Server versions, you can use the following code to display your instances: #list services using WMI foreach ($path in $namespace) { Write-Verbose "SQL Services in:$($path.__NAMESPACE)$($path.Name)" Get-WmiObject -ComputerName $hostName ` -Namespace "$($path.__NAMESPACE)$($path.Name)" ` -Class SqlService | Where-Object SQLServiceType -eq 1 | Select-Object ServiceName, DisplayName, SQLServiceType | Format-Table –AutoSize } Discovering SQL Server Services In this recipe, we will enumerate all SQL Server Services and list their statuses. Getting ready Check which SQL Server services are installed in your instance. Go to Start | Run and type services.msc. You should see a screen similar to this: How to do it... Let's assume you are running this script on the server box: Open PowerShell ISE as administrator. Add the following code and execute: Import-Module SQLPS -DisableNameChecking #you can replace localhost with your instance name $instanceName = "localhost" $managedComputer = New-Object Microsoft.SqlServer.Management.Smo.Wmi.ManagedComputer $instanceName #list services $managedComputer.Services | Select-Object Name, Type, ServiceState, DisplayName | Format-Table -AutoSize Your result will look similar to the one shown in the following screenshot: Items listed in your screen will vary depending on the features installed and running in your instance Confirm that these are the services that exist in your server. Check your services window. How it works... Services that are installed on a system can be queried using WMI. Specific services for SQL Server are exposed through SMO's WMI ManagedComputer object. Some of the exposed properties are as follows: ClientProtocols ConnectionSettings ServerAliases ServerInstances Services There's more... An alternative way to get SQL Server-related services is by using Get-WMIObject. We will need to pass in the host name as well as the SQL Server WMI Provider for the ComputerManagement namespace. For SQL Server 2014, this value is ROOTMicrosoftSQLServerComputerManagement12. The script to retrieve the services is provided here. Note that we are dynamically composing the WMI namespace. The code is as follows: $hostName = "localhost" $namespace = Get-WMIObject -ComputerName $hostName -NameSpace rootMicrosoftSQLServer -Class "__NAMESPACE" | Where-Object Name -like "ComputerManagement*" Get-WmiObject -ComputerName $hostname -Namespace "$($namespace.__NAMESPACE)$($namespace.Name)" -Class SqlService | Select-Object ServiceName If you have multiple SQL Server versions installed and want to see just the most recent version's services, you can limit to the latest namespace by adding Select-Object –Last 1: $namespace = Get-WMIObject -ComputerName $hostName -NameSpace rootMicrosoftSQLServer -Class "__NAMESPACE" | Where-Object Name -like "ComputerManagement*" | Select-Object –Last 1 Yet another alternative but less accurate way of listing possible SQL Server related services is the following snippet of code: #alterative - but less accurate Get-Service *SQL* This uses the Get-Service cmdlet and filters base on the service name. This is less accurate because this grabs all processes that have SQL in the name, but may not necessarily be related to SQL Server. For example, if you have MySQL installed, it will get picked up as a process. Conversely, this will not pick up SQL Server-related services that do not have SQL in the name, such as ReportServer. Summary You will find that many of the scripts can be accomplished using PowerShell and SQL Management Objects (SMO). SMO is a library that exposes SQL Server classes that allow programmatic manipulation and automation of many database tasks. For some , we will also explore alternative ways of accomplishing the same tasks using different native PowerShell cmdlets. Now that we have a gist of SQL Server 2014 with PowerShell, lets build a full-fledged e-commerce project with SQL Server 2014 with Powershell v5 Cookbook. Resources for Article: Further resources on this subject: Exploring Windows PowerShell 5.0 [article] Working with PowerShell [article] Installing/upgrading PowerShell [article]
Read more
  • 0
  • 0
  • 2512

article-image-overview-oozie
Packt
19 Oct 2015
5 min read
Save for later

An Overview of Oozie

Packt
19 Oct 2015
5 min read
In this article by Jagat Singh, the author of the book Apache Oozie Essentials, we will see a basic overview of Oozie and its concepts in brief. (For more resources related to this topic, see here.) Concepts Oozie is a workflow scheduler system to run Apache Hadoop jobs. Oozie workflow jobs are Directed Acyclic Graphs (DAGs) (https://en.wikipedia.org/wiki/Directed_acyclic_graph) representation of actions. Actions tell what to do in the job. Oozie supports running jobs of various types such as Java, Map-reduce, Pig, Hive, Sqoop, Spark, and Distcp. The output of one action can be consumed by the next action to create chain sequence. Oozie has client server architecture, in which we install the server for storing the jobs and using client we submit our jobs to the server. Let's get an idea of few basic concepts of Oozie. Workflow Workflow tells Oozie 'what' to do. It is a collection of actions arranged in required dependency graph. So as part of workflows definition we write some actions and call them in certain order. These are of various types for tasks, which we can do as part of workflow for example, Hadoop filesystem action, Pig action, Hive action, Mapreduce action , Spark action, and so on. Coordinator Coordinator tells Oozie 'when' to do. Coordinators let us to run inter-dependent workflows as data pipelines based on some starting criteria. Most of the Oozie jobs are triggered at given scheduled time interval or when input dataset is present for triggering the job. Following are important definitions related to coordinators: Nominal time: The scheduled time at which job should execute. Example, we process pressrelease every day at 8:00PM. Actual time: The real time when the job ran. In some cases if the input data does not arrive the job might start late. This type of data dependent job triggering is indicated by done-flag (more on this later). The done-flag gives signal to start the job execution. The general skeleton template of coordinator is shown in the following figure: Bundles Bundles tell Oozie which all things to do together as a group. For example a set of coordinators, which can be run together to satisfy a given business requirement can be combined as Bundle. Book case study One of the main used cases of Hadoop is ETL data processing. Suppose that we work for a large consulting company and have won project to setup Big data cluster inside customer data center. On high level the requirements are to setup environment that will satisfy the following flow: We get data from various sources in Hadoop (File based loads, Sqoop based loads) We preprocess them with various scripts (Pig, Hive, Mapreduce) Insert that data into Hive tables for use by analyst and data scientists Data scientists write machine learning models (Spark) We will be using Oozie as our processing scheduling system to do all the above. In our architecture we have one landing server, which sits outside as front door of the cluster. All source systems send files to us via scp and we regularly (for example, nightly to keep simple) push them to HDFS using the hadoop fs -copyFromLocal command. This script is cron driven. It has very simple business logic run every night at 8:00 PM and moves all the files, which it sees, on landing server into HDFS. The Oozie works as follows: Oozie picks the file and cleans it using Pig Script to replace all the delimiters from comma (,) to pipes (|). We will write the same code using Pig and Map Reduce. We then push those processed files into a Hive table. For different source system which is database based MySQL table we do nightly Sqoop when the load of Database in light. So we extract all the records that have been generated on previous business day. The output of that also we insert into Hive tables. Analyst and Data scientists write there magical Hive scripts and Spark machine learning models on those Hive tables. We will use Oozie to schedule all of these regular tasks. Node types Workflow is composed on nodes; the logical DAG of nodes represents 'what' part of the work done by Oozie. Each of the node does specified work and on success moves to one node or on failure moves to other node. For example on success go to OK node and on fail goes to Kill node. Nodes in the Oozie workflow are of the following types. Control flow nodes These nodes are responsible for defining start, end, and control flow of what to do inside the workflow. These can be from following: Start node End node Kill node Decision node Fork and Join node Action nodes Actions nodes represent the actual processing tasks, which are executed when called. These are of various types for example Pig action, Hive action, and Mapreduce action. Summary So in this article we looked at the concepts of Oozie in brief. We also learnt the types on nodes in Oozie. Resources for Article: Further resources on this subject: Introduction to Hadoop[article] Hadoop and HDInsight in a Heartbeat[article] Cloudera Hadoop and HP Vertica [article]
Read more
  • 0
  • 0
  • 1574

article-image-introducing-test-driven-machine-learning
Packt
14 Oct 2015
19 min read
Save for later

Introducing Test-driven Machine Learning

Packt
14 Oct 2015
19 min read
In this article by Justin Bozonier, the author of the book Test Driven Machine Learning, we will see how to develop complex software (sometimes rooted in randomness) in small, controlled steps also it will guide you on how to begin developing solutions to machine learning problems using test-driven development (from here, this will be written as TDD). Mastering TDD is not something the book will achieve. Instead, the book will help you begin your journey and expose you to guiding principles, which you can use to creatively solve challenges as you encounter them. We will answer the following three questions in this article: What are TDD and behavior-driven development (BDD)? How do we apply these concepts to machine learning, and making inferences and predictions? How does this work in practice? (For more resources related to this topic, see here.) After having answers to these questions, we will be ready to move onto tackling real problems. The book is about applying these concepts to solve machine learning problems. This article is the largest theoretical explanation that we will have with the remainder of the theory being described by example. Due to the focus on application, you will learn much more than what you can learn about the theory of TDD and BDD. To read more about the theory and ideals, search the internet for articles written by the following: Kent Beck—The father of TDD Dan North—The father of BDD Martin Fowler—The father of refactoring, he has also created a large knowledge base, on these topics James Shore—one of the author of The Art of Agile Development, has a deep theoretical understanding of TDD, and explains the practical value of it quite well These concepts are incredibly simple and yet can take a lifetime to master. When applied to machine learning, we must find new ways to control and/or measure the random processes inherent in the algorithm. This will come up in this article as well as others. In the next section, we will develop a foundation for TDD and begin to explore its application. Test-driven development Kent Beck wrote in his seminal book on the topic that TDD consists of only two specific rules, which are as follows: Don't write a line of new code unless you first have a failing automated test Eliminate duplication This as he noted fairly quickly leads us to a mantra, really the mantra of TDD: Red, Green, Refactor. If this is a bit abstract, let me restate it that TDD is a software development process that enables a programmer to write code that specifies the intended behavior before writing any software to actually implement the behavior. The key value of TDD is that at each step of the way, you have working software as well as an itemized set of specifications. TDD is a software development process that requires the following: Writing code to detect the intended behavioral change. Rapid iteration cycle that produces working software after each iteration. It clearly defines what a bug is. If a test is not failing but a bug is found, it is not a bug. It is a new feature. Another point that Kent makes is that ultimately, this technique is meant to reduce fear in the development process. Each test is a checkpoint along the way to your goal. If you stray too far from the path and wind up in trouble, you can simply delete any tests that shouldn't apply, and then work your code back to a state where the rest of your tests pass. There's a lot of trial and error inherent in TDD, but the same matter applies to machine learning. The software that you design using TDD will also be modular enough to be able to have different components swapped in and out of your pipeline. You might be thinking that just thinking through test cases is equivalent to TDD. If you are like the most people, what you write is different from what you might verbally say, and very different from what you think. By writing the intent of our code before we write our code, it applies a pressure to our software design that prevents you from writing "just in case" code. By this I mean the code that we write just because we aren't sure if there will be a problem. Using TDD, we think of a test case, prove that it isn't supported currently, and then fix it. If we can't think of a test case, we don't add code. TDD can and does operate at many different levels of the software under development. Tests can be written against functions and methods, entire classes, programs, web services, neural networks, random forests, and whole machine learning pipelines. At each level, the tests are written from the perspective of the prospective client. How does this relate to machine learning? Lets take a step back and reframe what I just said. In the context of machine learning, tests can be written against functions, methods, classes, mathematical implementations, and the entire machine learning algorithms. TDD can even be used to explore technique and methods in a very directed and focused manner, much like you might use a REPL (an interactive shell where you can try out snippets of code) or the interactive (I)Python session. The TDD cycle The TDD cycle consists of writing a small function in the code that attempts to do something that we haven't programmed yet. These small test methods will have three main sections; the first section is where we set up our objects or test data; another section is where we invoke the code that we're testing; and the last section is where we validate that what happened is what we thought would happen. You will write all sorts of lazy code to get your tests to pass. If you are doing it right, then someone who is watching you should be appalled at your laziness and tiny steps. After the test goes green, you have an opportunity to refactor your code to your heart's content. In this context, refactor refers to changing how your code is written, but not changing how it behaves. Lets examine more deeply the three steps of TDD: Red, Green, and Refactor. Red First, create a failing test. Of course, this implies that you know what failure looks like in order to write the test. At the highest level in machine learning, this might be a baseline test where baseline is a better than random test. It might even be predicts random things, or even simpler always predicts the same thing. Is this terrible? Perhaps, it is to some who are enamored with the elegance and artistic beauty of his/her code. Is it a good place to start, though? Absolutely. A common issue that I have seen in machine learning is spending so much time up front, implementing the one true algorithm that hardly anything ever gets done. Getting to outperform pure randomness, though, is a useful change that can start making your business money as soon as it's deployed. Green After you have established a failing test, you can start working to get it green. If you start with a very high-level test, you may find that it helps to conceptually break that test up into multiple failing tests that are the lower-level concerns. I'll dive deeper into this later on in this article but for now, just know that you want to get your test passing as soon as possible; lie, cheat, and steal to get there. I promise that cheating actually makes your software's test suite that much stronger. Resist the urge to write the software in an ideal fashion. Just slap something together. You will be able to fix the issues in the next step. Refactor You got your test to pass through all the manners of hackery. Now, you get to refactor your code. Note that it is not to be interpreted loosely. Refactor specifically means to change your software without affecting its behavior. If you add the if clauses, or any other special handling, you are no longer refactoring. Then you write the software without tests. One way where you will know for sure that you are no longer refactoring is that you've broken previously passing tests. If this happens, we back up our changes until our tests pass again. It may not be obvious but this isn't all that it takes for you to know that you haven't changed behavior. Read Refactoring: Improving the Design of Existing Code, Martin Fowler for you to understand how much you should really care for refactoring. By the way of his illustration in this book, refactoring code becomes a set of forms and movements not unlike karate katas. This is a lot of general theory, but what does a test actually look like? How does this process flow in a real problem? Behavior-driven development BDD is the addition of business concerns to the technical concerns more typical of TDD. This came about as people became more experienced with TDD. They started noticing some patterns in the challenges that they were facing. One especially influential person, Dan North, proposed some specific language and structure to ease some of these issues. Some issues he noticed were the following: People had a hard time understanding what they should test next. Deciding what to name a test could be difficult. How much to test in a single test always seemed arbitrary. Now that we have some context, we can define what exactly BDD is. Simply put, it's about writing our tests in such a way that they will tell us the kind of behavior change they affect. A good litmus test might be asking oneself if the test you are writing would be worth explaining to a business stakeholder. How this solves the previous may not be completely obvious, but it may help to illustrate what this looks like in practice. It follows a structure of given, when, then. Committing to this style completely can require specific frameworks or a lot of testing ceremony. As a result, I loosely follow this in my tests as you will see soon. Here's a concrete example of a test description written in this style Given an empty dataset when the classifier is trained, it should throw an invalid operation exception. This sentence probably seems like a small enough unit of work to tackle, but notice that it's also a piece of work that any business user, who is familiar with the domain that you're working in, would understand and have an opinion on. You can read more about Dan North's point of view in this article on his website at dannorth.net/introducing-bdd/. The BDD adherents tend to use specialized tools to make the language and test result reports be as accessible to business stakeholders as possible. In my experience and from my discussions with others, this extra elegance is typically used so little that it doesn't seem worthwhile. The approach you will learn in the book will take a simplicity first approach to make it as easy as possible for someone with zero background to get up to speed. With this in mind, lets work through an example. Our first test Let's start with an example of what a test looks like in Python. The main reason for using this is that while it is a bit of a pain to install a library, this library, in particular, will make everything that we do much simpler. The default unit test solution in Python requires a heavier set up. On top of this, by using nose, we can always mix in tests that use the built-in solution where we find that we need the extra features. First, install it like this: pip install nose If you have never used pip before, then it is time for you to know that it is a very simple way to install new Python libraries. Now, as a hello world style example, lets pretend that we're building a class that will guess a number using the previous guesses to inform it. This is the first simplest example to get us writing some code. We will use the TDD cycle that we discussed previously, and write our first test in painstaking detail. After we get through our first test and have something concrete to discuss, we will talk about the anatomy of the test that we wrote. First, we must write a failing test. The simplest failing test that I can think of is the following: def given_no_information_when_asked_to_guess_test(): number_guesser = NumberGuesser() result = number_guesser.guess() assert result is None, "Then it should provide no result." The context for assert is in the test name. Reading the test name and then the assert name should do a pretty good job of describing what is being tested. Notice that in my test, I instantiate a NumberGuesser object. You're not missing any steps; this class doesn't exist yet. This seems roughly like how I'd want to use it. So, it's a great place to start with. Since it doesn't exist, wouldn't you expect this test to fail? Lets test this hypothesis. To run the test, first make sure your test file is saved so that it ends in _tests.py. From the directory with the previous code, just run the following: nosetests When I do this, I get the following result: Here's a lot going on here, but the most informative part is near the end. The message is saying that NumberGuesser does not exist yet, which is exactly what I expected since we haven't actually written the code yet. Throughout the book, we'll reduce the detail of the stack traces that we show. For now, we'll keep things detailed to make sure that we're on the same page. At this point, we're in a red state in the TDD cycle. Use the following steps to create our first successful test: Now, create the following class in a file named NumberGuesser.py: class NumberGuesser: """Guesses numbers based on the history of your input"" Import the new class at the top of my test file with a simple import NumberGuesser statement. I rerun nosetests, and get the following: TypeError: 'module' object is not callable Oh whoops! I guess that's not the right way to import the class. This is another very tiny step, but what is important is that we are making forward progress through constant communication with our tests. We are going through extreme detail because I can't stress this point enough; bear with me for the time being. Change the import statement to the following: from NumberGuesser import NumberGuesser Rerun nosetests and you will see the following: AttributeError: NumberGuesser instance has no attribute 'guess' The error message has changed, and is leading to the next thing that needs to be changed. From here, just implement what we think we need for the test to pass: class NumberGuesser: """Guesses numbers based on the history of your input""" def guess(self): return None On rerunning the nosetests, we'll get the following result: That's it! Our first successful test! Some of these steps seem so tiny so as to not being worthwhile. Indeed, overtime, you may decide that you prefer to work on a different level of detail. For the sake of argument, we'll be keeping our steps pretty small if only to illustrate just how much TDD keeps us on track and guides us on what to do next. We all know how to write the code in very large, uncontrolled steps. Learning to code surgically requires intentional practice, and is worth doing explicitly. Lets take a step back and look at what this first round of testing took. Anatomy of a test Starting from a higher level, notice how I had a dialog with Python. I just wrote the test and Python complained that the class that I was testing didn't exist. Next, I created the class, but then Python complained that I didn't import it correctly. So then, I imported it correctly, and Python complained that my guess method didn't exist. In response, I implemented the way that my test expected, and Python stopped complaining. This is the spirit of TDD. You have a conversation between you and your system. You can work in steps as little or as large as you're comfortable with. What I did previously could've been entirely skipped over, and the Python class could have been written and imported correctly the first time. The longer you go without talking to the system, the more likely you are to stray from the path to getting things working as simply as possible. Lets zoom in a little deeper and dissect this simple test to see what makes it tick. Here is the same test, but I've commented it, and broken it into sections that you will see recurring in every test that you write: def given_no_information_when_asked_to_guess_test(): # given number_guesser = NumberGuesser() # when guessed_number = number_guesser.guess() # then assert guessed_number is None, 'there should be no guess.' Given This section sets up the context for the test. In the previous test, you acquired that I didn't provide any prior information to the object. In many of our machine learning tests, this will be the most complex portion of our test. We will be importing certain sets of data, sometimes making a few specific issues in the data and testing our software to handle the details that we would expect. When you think about this section of your tests, try to frame it as Given this scenario… In our test, we might say Given no prior information for NumberGuesser… When This should be one of the simplest aspects of our test. Once you've set up the context, there should be a simple action that triggers the behavior that you want to test. When you think about this section of your tests, try to frame it as When this happens… In our test we might say When NumberGuesser guesses a number… Then This section of our test will check on the state of our variables and any return result if applicable. Again, this section should also be fairly straight-forward, as there should be only a single action that causes a change into your object under the test. The reason for this is that if it takes two actions to form a test, then it is very likely that we will just want to combine the two into a single action that we can describe in terms that are meaningful in our domain. A key example maybe loading the training data from a file and training a classifier. If we find ourselves doing this a lot, then why not just create a method that loads data from a file for us? In the book, you will find examples where we'll have the helper functions help us determine whether our results have changed in certain ways. Typically, we should view these helper functions as code smells. Remember that our tests are the first applications of our software. Anything that we have to build in addition to our code, to understand the results, is something that we should probably (there are exceptions to every rule) just include in the code we are testing. Given, When, Then is not a strong requirement of TDD, because our previous definition of TDD only consisted of two things (all that the code requires is a failing test first and an eliminate duplication). It's a small thing to be passionate about and if it doesn't speak to you, just translate this back into Arrange, act, assert in your head. At the very least, consider it as well as why these specific, very deliberate words are used. Applied to machine learning At this point, you maybe wondering how TDD will be used in machine learning, and whether we use it on regression or classification problems. In every machine learning algorithm, there exists a way to quantify the quality of what you're doing. In the linear regression; it's your adjusted R2 value; in classification problems, it's an ROC curve (and the area beneath it) or a confusion matrix, and more. All of these are testable quantities. Of course, none of these quantities have a built-in way of saying that the algorithm is good enough. We can get around this by starting our work on every problem by first building up a completely naïve and ignorant algorithm. The scores that we get for this will basically represent a plain, old, and random chance. Once we have built an algorithm that can beat our random chance scores, we just start iterating, attempting to beat the next highest score that we achieve. Benchmarking algorithms are an entire field onto their own right that can be delved in more deeply. In the book, we will implement a naïve algorithm to get a random chance score, and we will build up a small test suite that we can then use to pit this model against another. This will allow us to have a conversation with our machine learning models in the same manner as we had with Python earlier. For a professional machine learning developer, it's quite likely that an ideal metric to test is a profitability model that compares risk (monetary exposure) to expected value (profit). This can help us keep a balanced view of how much error and what kind of error we can tolerate. In machine learning, we will never have a perfect model, and we can search for the rest of our lives for the best model. By finding a way to work your financial assumptions into the model, we will have an improved ability to decide between the competing models. Summary In this article, you were introduced to TDD as well as BDD. With these concepts introduced, you have a basic foundation with which to approach machine learning. We saw that the specifying behavior in the form of sentences makes for an easier to ready a set of specifications for your software. Building off of that foundation, we started to delve into testing at a higher level. We did this by establishing concepts that we can use to quantify classifiers: the ROC curve and AUC metric. Now, we've seen that different models can be quantified; it follows that they can be compared. Putting all of this together, we have everything we need to explore machine learning with a test-driven methodology. Resources for Article: Further resources on this subject: Optimization in Python[article] How to do Machine Learning with Python[article] Modeling complex functions with artificial neural networks [article]
Read more
  • 0
  • 0
  • 2665
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
article-image-transactions-and-operators
Packt
13 Oct 2015
14 min read
Save for later

Transactions and Operators

Packt
13 Oct 2015
14 min read
In this article by Emilien Kenler and Federico Razzoli, author of the book MariaDB Essentials, he has explained in brief about transactions and operators. (For more resources related to this topic, see here.) Understanding transactions A transaction is a sequence of SQL statements that are grouped into a single logical operation. Its purpose is to guarantee the integrity of data. If a transaction fails, no change will be applied to the databases. If a transaction succeeds, all the statements will succeed. Take a look at the following example: START TRANSACTION; SELECT quantity FROM product WHERE id = 42; UPDATE product SET quantity = quantity - 10 WHERE id = 42; UPDATE customer SET money = money - 0(SELECT price FROM product WHERE id = 42) WHERE id = 512; INSERT INTO product_order (product_id, quantity, customer_id) VALUES (42, 10, 512); COMMIT; We haven't yet discussed some of the statements used in this example. However, they are not important to understand transactions. This sequence of statements occur when a customer (whose id is 512) ordered a product (whose id is 42). As a consequence, we need to execute the following suboperations in our database: Check whether the desired quantity of products is available. If not, we should not proceed Decrease the available quantity of items for the product that is being bought Decrease the amount of money in the online account of our customer Register the order so that the product is delivered to our customer These suboperations form a more complex operation. When a session is executing this operation, we do not want other connections to interfere. Consider the following scenarios: Connection checks how many products with the ID 42 are available. Only one is available, but it is enough. Immediately after, the connection B checks the availability of the same product. It finds that one is available. Connection A decreases the quantity of the product. Now, it is 0. Connection B decreases the same number. Now, it is -1. Both connections create an order. Two persons will pay for the same product; however, only one is available. This is something we definitely want to avoid. However, there is another situation that we want to avoid. Imagine that the server crashes immediately after the customer's money is deducted. The order will not be written to the database, so the customer will end up paying for something he will not receive. Fortunately, transactions prevent both these situations. They protect our database writes in two ways: During a transaction, relevant data is locked or copied. In both these cases, two connections will not be able to modify the same rows at the same time. The writes will not be made effective until the COMMIT command is issued. This means that if the server crashes during the transaction, all the suboperations will be rolled back. We will not have inconsistent data (such as a payment for a product that will not be delivered). In this example, the transaction starts when we issue the START TRANSACTION command. Then, any number of operations can be performed. The COMMIT command makes the changes effective. This does not mean that if a statement fails with an error, the transaction is always aborted. In many cases, the application will receive an error and will be free to decide whether the transaction should be aborted or not. To abort the current transaction, an application can execute the ROLLBACK command. A transaction can consist of only one statement. This perfectly makes sense because the server could crash in the middle of the statement's execution. The autocommit mode In many cases, we don't want to group multiple statements in a transaction. When a transaction consists of only one statement, sending the START TRANSACTION and COMMIT statements can be annoying. For this reason, MariaDB has the autocommit mode. By default, the autocommit mode is ON. Unless a START TRANSACTION command is explicitly used, the autocommit mode causes an implicit commit after each statement. Thus, every statement is executed in a separated transaction by default. When the autocommit mode is OFF, a new transaction implicitly starts after each commit, and the COMMIT command needs be issued explicitly. To turn the autocommit ON or OFF, we can use the @@autocommit server variable as follows: follows: MariaDB [mwa]> SET @@autocommit = OFF; Query OK, 0 rows affected (0.00 sec) MariaDB [mwa]> SELECT @@autocommit; +--------------+ | @@autocommit | +--------------+ | 0 | +--------------+ 1 row in set (0.00 sec) Transaction's limitations in MariaDB Transaction handling is not implemented in the core of MariaDB; instead, it is left to the storage engines. Many storage engines, such as MyISAM or MEMORY, do not implement it at all. Some of the transactional storage engines are: InnoDB; XtraDB; TokuDB. In a sense, Aria tables are partially transactional. Although Aria ignores commands such as START TRANSACTION, COMMIT, and ROLLBACK, each statement is somewhat a transaction. In fact, if it writes, modifies, or deletes multiple rows, the operation completely succeeds or fails, which is similar to a transaction. Only statements that modify data can be used in a transaction. Statements that modify a table structure (such as ALTER TABLE) implicitly commit the current transaction. Sometimes, we may not be sure if a transaction is active or not. Usually, this happens because we are not sure if autocommit is set to ON or not or because we are not sure if the latest statement implicitly committed a transaction. In these cases, the @in_transaction variable can help us. Its value is 1 if a transaction is active and 0 if it is not. Here is an example: MariaDB [mwa]> START TRANSACTION; Query OK, 0 rows affected (0.00 sec) MariaDB [mwa]> SELECT @@in_transaction; +------------------+ | @@in_transaction | +------------------+ | 1 | +------------------+ 1 row in set (0.00 sec) MariaDB [mwa]> DROP TABLE IF EXISTS t; Query OK, 0 rows affected, 1 warning (0.00 sec) MariaDB [mwa]> SELECT @@in_transaction; +------------------+ | @@in_transaction | +------------------+ | 0 | +------------------+ 1 row in set (0.00 sec) InnoDB is optimized to execute a huge number of short transactions. If our databases are busy and performance is important to us, we should try to avoid big transactions in terms of the number of statements and execution time. This is particularly true if we have several concurrent connections that read the same tables. Working with operators In our examples, we have used several operators, such as equals (=), less-than and greater-than (<, >), and so on. Now, it is time to discuss operators in general and list the most important ones. In general, an operator is a sign that takes one or more operands and returns a result. Several groups of operators exist in MariaDB. In this article, we will discuss the main types: Comparison operators; String operators; Logical operators; Arithmetic operators. Comparison operators A comparison operator checks whether there is a certain relation between its operands. If the relationship exists, the operator returns 1; otherwise, it returns 0. For example, let's take the equality operator that is probably used the most: 1 = 1 -- returns 1: the equality relationship exists 1 = 0 -- returns 0: no equality relationship here In MariaDB, 1 and 0 are used in many contexts to indicate whether something is true or false. In fact, MariaDB does not have a Boolean data type, so TRUE and FALSE are merely used as aliases for 1 and 0: TRUE = 1 -- returns 1 FALSE = 0 -- returns 1 TRUE = FALSE -- returns 0 In a WHERE clause, a result of 0 or NULL prevents a row to be shown. All the numeric results other than 0, including negative numbers, are regarded as true in this context. Non-numeric values other than NULL need to be converted to numbers in order to be evaluated by the WHERE clause. Non-numeric strings are converted to 0, whereas numeric strings are treated as numbers. Dates are converted to nonzero numbers.Consider the following example: WHERE 1 -- is redundant; it shows all the rows WHERE 0 -- prevents all the rows from being shown Now, let's take a look at the following MariaDB comparison operators: Operator Description Example = This specifies equality A = B != This indicates inequality A != B <>  This is the synonym for != A <> B <  This denotes less than A < B >  This indicates greater than A > B <= This refers to less than or equals to A <= B >= This specifies greater than or equals to A >= B IS NULL This indicates that the operand is NULL A IS NULL IS NOT NULL The operand is not NULL A IS NOT NULL <=> This denotes that the operands are equal, or both are NULL A <=> B BETWEEN ... AND This specifies that the left operand is within a range of values A BETWEEN B AND C NOT BETWEEN ... AND This indicates that the left operand is outside the specified range A NOT BETWEEN B AND C IN This denotes that the left operand is one of the items in a given list A IN (B, C, D) NOT IN This indicates that the left operand is not in the given list A NOT IN (B, C, D) Here are a couple of examples: SELECT id FROM product WHERE price BETWEEN 100 AND 200; DELETE FROM product WHERE id IN (100, 101, 102); Special attention should be paid to NULL values. Almost all the preceding operators return NULL if any of their operands is NULL. The reason is quite clear, that is, as NULL represents an unknown value, any operation involving a NULL operand returns an unknown result. However, there are some operators specifically designed to work with NULL values. IS NULL and IS NOT NULL checks whether the operand is NULL. The <=> operator is a shortcut for the following code: a = b OR (a IS NULL AND b IS NULL) String operators MariaDB supports certain comparison operators that are specifically designed to work with string values. This does not mean that other operators does not work well with strings. For example, A = B perfectly works if A and B are strings. However, some particular comparisons only make sense with text values. Let's take a look at them. The LIKE operator and its variants This operator is often used to check whether a string starts with a given sequence of characters, if it ends with that sequence, or if it contains the sequence. More generally, LIKE checks whether a string follows a given pattern. Its syntax is: <string_value> LIKE <pattern> The pattern is a string that can contain the following wildcard characters: _ (underscore) means: This specifies any character %: This denotes any sequence of 0 or more characters There is also a way to include these characters without their special meaning: the _ and % sequences represent the a_ and a% characters respectively. For example, take a look at the following expressions: my_text LIKE 'h_' my_text LIKE 'h%' The first expression returns 1 for 'hi', 'ha', or 'ho', but not for 'hey'. The second expression returns 1 for all these strings, including 'hey'. By default, LIKE is case insensitive, meaning that 'abc' LIKE 'ABC' returns 1. Thus, it can be used to perform a case insensitive equality check. To make LIKE case sensitive, the following BINARY keyword can be used: my_text LIKE BINARY your_text The complement of LIKE is NOT LIKE, as shown in the following code: <string_value> NOT LIKE <pattern> Here are the most common uses for LIKE: my_text LIKE 'my%' -- does my_text start with 'my'? my_text LIKE '%my' -- does my_text end with 'my'? my_text LIKE '%my%' -- does my_text contain 'my'? More complex uses are possible for LIKE. For example, the following expression can be used to check whether mail is a valid e-mail address: mail LIKE '_%@_%.__%' The preceding code snippet checks whether mail contains at least one character, a '@' character, at least one character, a dot, at least two characters in this order. In most cases, an invalid e-mail address will not pass this test. Using regular expressions with the REGEXP operator and its variants Regular expressions are string patterns that contain a meta character with special meanings in order to perform match operations and determine whether a given string matches the given pattern or not. The REGEXP operator is somewhat similar to LIKE. It checks whether a string matches a given pattern. However, REGEXP uses regular expressions with the syntax defined by the POSIX standard. Basically, this means that: Many developers, but not all, already know their syntax REGEXP uses a very expressive syntax, so the patterns can be much more complex and detailed REGEXP is much slower than LIKE; this should be preferred when possible The regular expressions syntax is a complex topic, and it cannot be covered in this article. Developers can learn about regular expressions at www.regular-expressions.info. The complement of REGEXP is NOT REGEXP. Logical operators Logical operators can be used to combine truth expressions that form a compound expression that can be true, false, or NULL. Depending on the truth values of its operands, a logical operator can return 1 or 0. MariaDB supports the following logical operators: NOT; AND; OR; XOR The NOT operator NOT is the only logical operator that takes one operand. It inverts its truth value. If the operand is true, NOT returns 0, and if the operand is false, NOT returns 1. If the operand is NULL, NOT returns NULL. Here is an example: NOT 1 -- returns 0 NOT 0 -- returns 1 NOT 1 = 1 -- returns 0 NOT 1 = NULL -- returns NULL NOT 1 <=> NULL -- returns 0 The AND operator AND returns 1 if both its operands are true and 0 in all other cases. Here is an example: 1 AND 1 -- returns 1 0 AND 1 -- returns 0 0 AND 0 -- returns 0 The OR operator OR returns 1 if at least one of its operators is true or 0 if both the operators are false. Here is an example: 1 OR 1 -- returns 1 0 OR 1 -- returns 1 0 OR 0 -- returns 0 The XOR operator XOR stands for eXclusive OR. It is the least used logical operator. It returns 1 if only one of its operators is true or 0 if both the operands are true or false. Take a look at the following example: 1 XOR 1 -- returns 0 1 XOR 0 -- returns 1 0 XOR 1 --returns 1 0 XOR 0 -- returns 0 A XOR B is the equivalent of the following expression: (A OR B) AND NOT (A AND B) Or: (NOT A AND B) OR (A AND NOT B) Arithmetic operators MariaDB supports the operators that are necessary to execute all the basic arithmetic operations. The supported arithmetic operators are: + for additions - for subtractions * for multiplication / for division Depending on the MariaDB configuration, remember that a division by 0 raises an error or returns NULL. In addition, two more operators are useful for divisions: DIV: This returns the integer part of a division without any decimal part or reminder MOD or %: This returns the reminder of a division Here is an example: MariaDB [(none)]> SELECT 20 DIV 3 AS int_part, 20 MOD 3 AS modulus; +----------+---------+ | int_part | modulus | +----------+---------+ | 6 | 2 | +----------+---------+ 1 row in set (0.00 sec) Operators precedence MariaDB does not blindly evaluate the expression from left to right. Every operator has a given precedence. The And operators that is evaluated before another one is said to have a higher precedence. In general, arithmetic and string operators have a higher priority than logical operators. The precedence of arithmetic operators reflect their precedence in common mathematical expressions. It is very important to remember the precedence of logical operators (from the highest to the lowest): NOT AND XOR OR MariaDB supports many operators, and we did not discuss all of them. Also, the exact precedence can slightly vary depending on the MariaDB configuration. The complete precedence can be found in the MariaDB KnowledgeBase, at https://mariadb.com/kb/en/mariadb/documentation/functions-and-operators/operator-precedence/. Parenthesis can be used to force MariaDB to follow a certain order. They are also useful when we do not remember the exact precedence of the operators that we will use, as shown in the following code: (NOT (a AND b)) OR c OR d Summary In this article you learned about the basic transactions and operators. Resources for Article: Further resources on this subject: Set Up MariaDB [Article] Installing MariaDB on Windows and Mac OS X [Article] Building a Web Application with PHP and MariaDB – Introduction to caching [Article]
Read more
  • 0
  • 0
  • 1406

article-image-securing-your-data
Packt
12 Oct 2015
6 min read
Save for later

Securing Your Data

Packt
12 Oct 2015
6 min read
In this article by Tyson Cadenhead, author of Socket.IO Cookbook, we will explore several topics related to security in Socket.IO applications. These topics will cover the gambit, from authentication and validation to how to use the wss:// protocol for secure WebSockets. As the WebSocket protocol opens innumerable opportunities to communicate more directly between the client and the server, people often wonder if Socket.IO is actually as secure as something such as the HTTP protocol. The answer to this question is that it depends entirely on how you implement it. WebSockets can easily be locked down to prevent malicious or accidental security holes, but as with any API interface, your security is only as tight as your weakest link. In this article, we will cover the following topics: Locking down the HTTP referrer Using secure WebSockets (For more resources related to this topic, see here.) Locking down the HTTP referrer Socket.IO is really good at getting around cross-domain issues. You can easily include the Socket.IO script from a different domain on your page, and it will just work as you may expect it to. There are some instances where you may not want your Socket.IO events to be available on every other domain. Not to worry! We can easily whitelist only the http referrers that we want so that some domains will be allowed to connect and other domains won't. How To Do It… To lock down the HTTP referrer and only allow events to whitelisted domains, follow these steps: Create two different servers that can connect to our Socket.IO instance. We will let one server listen on port 5000 and the second server listen on port 5001: var express = require('express'), app = express(), http = require('http'), socketIO = require('socket.io'), server, server2, io; app.get('/', function (req, res) { res.sendFile(__dirname + '/index.html'); }); server = http.Server(app); server.listen(5000); server2 = http.Server(app); server2.listen(5001); io = socketIO(server); When the connection is established, check the referrer in the headers. If it is a referrer that we want to give access to, we can let our connection perform its tasks and build up events as normal. If a blacklisted referrer, such as the one on port 5001 that we created, attempts a connection, we can politely decline and perhaps throw an error message back to the client, as shown in the following code: io.on('connection', function (socket) { switch (socket.request.headers.referer) { case 'http://localhost:5000/': socket.emit('permission.message', 'Okay, you're cool.'); break; default: returnsocket.emit('permission.message', 'Who invited you to this party?'); break; } }); On the client side, we can listen to the response from the server and react as appropriate using the following code: socket.on('permission.message', function (data) { document.querySelector('h1').innerHTML = data; }); How It Works… The referrer is always available in the socket.request.headers object of every socket, so we will be able to inspect it there to check whether it was a trusted source. In our case, we will use a switch statement to whitelist our domain on port 5000, but we could really use any mechanism at our disposal to perform the task. For example, if we need to dynamically whitelist domains, we can store a list of them in our database and search for it when the connection is established. Using secure WebSockets WebSocket communications can either take place over the ws:// protocol or the wss:// protocol. In similar terms, they can be thought of as the HTTP and HTTPS protocols in the sense that one is secure and one isn't. Secure WebSockets are encrypted by the transport layer, so they are safer to use when you handle sensitive data. In this recipe, you will learn how to force our Socket.IO communications to happen over the wss:// protocol for an extra layer of encryption. Getting Ready… In this recipe, we will need to create a self-signing certificate so that we can serve our app locally over the HTTPS protocol. For this, we will need an npm package called Pem. This allows you to create a self-signed certificate that you can provide to your server. Of course, in a real production environment, we would want a true SSL certificate instead of a self-signed one. To install Pem, simply call npm install pem –save. As our certificate is self-signed, you will probably see something similar to the following screenshot when you navigate to your secure server: Just take a chance by clicking on the Proceed to localhost link. You'll see your application load using the HTTPS protocol. How To Do It… To use the secure wss:// protocol, follow these steps: First, create a secure server using the built-in node HTTPS package. We can create a self-signed certificate with the pem package so that we can serve our application over HTTPS instead of HTTP, as shown in the following code: var https = require('https'), pem = require('pem'), express = require('express'), app = express(), socketIO = require('socket.io'); // Create a self-signed certificate with pem pem.createCertificate({ days: 1, selfSigned: true }, function (err, keys) { app.get('/', function(req, res){ res.sendFile(__dirname + '/index.html'); }); // Create an https server with the certificate and key from pem var server = https.createServer({ key: keys.serviceKey, cert: keys.certificate }, app).listen(5000); vario = socketIO(server); io.on('connection', function (socket) { var protocol = 'ws://'; // Check the handshake to determine if it was secure or not if (socket.handshake.secure) { protocol = 'wss://'; } socket.emit('hello.client', { message: 'This is a message from the server. It was sent using the ' + protocol + ' protocol' }); }); }); In your client-side JavaScript, specify secure: true when you initialize your WebSocket as follows: var socket = io('//localhost:5000', { secure: true }); socket.on('hello.client', function (data) { console.log(data); }); Now, start your server and navigate to https://localhost:5000. Proceed to this page. You should see a message in your browser developer tools that shows, This is a message from the server. It was sent using the wss:// protocol. How It Works… The protocol of our WebSocket is actually set automatically based on the protocol of the page that it sits on. This means that a page that is served over the HTTP protocol will send the WebSocket communications over ws:// by default, and a page that is served by HTTPS will default to using the wss:// protocol. However, by setting the secure option to true, we told the WebSocket to always serve through wss:// no matter what. Summary In this article, we gave you an overview of the topics related to security in Socket.IO applications. Resources for Article: Further resources on this subject: Using Socket.IO and Express together[article] Adding Real-time Functionality Using Socket.io[article] Welcome to JavaScript in the full stack [article]
Read more
  • 0
  • 0
  • 698

article-image-basics-jupyter-notebook-python
Packt Editorial Staff
11 Oct 2015
28 min read
Save for later

Basics of Jupyter Notebook and Python

Packt Editorial Staff
11 Oct 2015
28 min read
In this article by Cyrille Rossant, coming from his book, Learning IPython for Interactive Computing and Data Visualization - Second Edition, we will see how to use IPython console, Jupyter Notebook, and we will go through the basics of Python. Originally, IPython provided an enhanced command-line console to run Python code interactively. The Jupyter Notebook is a more recent and more sophisticated alternative to the console. Today, both tools are available, and we recommend that you learn to use both. [box type="note" align="alignleft" class="" width=""]The first chapter of the book, Chapter 1, Getting Started with IPython, contains all installation instructions. The main step is to download and install the free Anaconda distribution at https://www.continuum.io/downloads (the version of Python 3 64-bit for your operating system).[/box] Launching the IPython console To run the IPython console, type ipython in an OS terminal. There, you can write Python commands and see the results instantly. Here is a screenshot: IPython console The IPython console is most convenient when you have a command-line-based workflow and you want to execute some quick Python commands. You can exit the IPython console by typing exit. [box type="note" align="alignleft" class="" width=""]Let's mention the Qt console, which is similar to the IPython console but offers additional features such as multiline editing, enhanced tab completion, image support, and so on. The Qt console can also be integrated within a graphical application written with Python and Qt. See http://jupyter.org/qtconsole/stable/ for more information.[/box] Launching the Jupyter Notebook To run the Jupyter Notebook, open an OS terminal, go to ~/minibook/ (or into the directory where you've downloaded the book's notebooks), and type jupyter notebook. This will start the Jupyter server and open a new window in your browser (if that's not the case, go to the following URL: http://localhost:8888). Here is a screenshot of Jupyter's entry point, the Notebook dashboard: The Notebook dashboard [box type="note" align="alignleft" class="" width=""]At the time of writing, the following browsers are officially supported: Chrome 13 and greater; Safari 5 and greater; and Firefox 6 or greater. Other browsers may work also. Your mileage may vary.[/box] The Notebook is most convenient when you start a complex analysis project that will involve a substantial amount of interactive experimentation with your code. Other common use-cases include keeping track of your interactive session (like a lab notebook), or writing technical documents that involve code, equations, and figures. In the rest of this section, we will focus on the Notebook interface. [box type="note" align="alignleft" class="" width=""]Closing the Notebook server To close the Notebook server, go to the OS terminal where you launched the server from, and press Ctrl + C. You may need to confirm with y.[/box] The Notebook dashboard The dashboard contains several tabs which are as follows: Files: shows all files and notebooks in the current directory Running: shows all kernels currently running on your computer Clusters: lets you launch kernels for parallel computing A notebook is an interactive document containing code, text, and other elements. A notebook is saved in a file with the .ipynb extension. This file is a plain text file storing a JSON data structure. A kernel is a process running an interactive session. When using IPython, this kernel is a Python process. There are kernels in many languages other than Python. [box type="note" align="alignleft" class="" width=""]We follow the convention to use the term notebook for a file, and Notebook for the application and the web interface.[/box] In Jupyter, notebooks and kernels are strongly separated. A notebook is a file, whereas a kernel is a process. The kernel receives snippets of code from the Notebook interface, executes them, and sends the outputs and possible errors back to the Notebook interface. Thus, in general, the kernel has no notion of the Notebook. A notebook is persistent (it's a file), whereas a kernel may be closed at the end of an interactive session and it is therefore not persistent. When a notebook is re-opened, it needs to be re-executed. In general, no more than one Notebook interface can be connected to a given kernel. However, several IPython consoles can be connected to a given kernel. The Notebook user interface To create a new notebook, click on the New button, and select Notebook (Python 3). A new browser tab opens and shows the Notebook interface as follows: A new notebook Here are the main components of the interface, from top to bottom: The notebook name, which you can change by clicking on it. This is also the name of the .ipynb file. The Menu bar gives you access to several actions pertaining to either the notebook or the kernel. To the right of the menu bar is the Kernel name. You can change the kernel language of your notebook from the Kernel menu. The Toolbar contains icons for common actions. In particular, the dropdown menu showing Code lets you change the type of a cell. Following is the main component of the UI: the actual Notebook. It consists of a linear list of cells. We will detail the structure of a cell in the following sections. Structure of a notebook cell There are two main types of cells: Markdown cells and code cells, and they are described as follows: A Markdown cell contains rich text. In addition to classic formatting options like bold or italics, we can add links, images, HTML elements, LaTeX mathematical equations, and more. A code cell contains code to be executed by the kernel. The programming language corresponds to the kernel's language. We will only use Python in this book, but you can use many other languages. You can change the type of a cell by first clicking on a cell to select it, and then choosing the cell's type in the toolbar's dropdown menu showing Markdown or Code. Markdown cells Here is a screenshot of a Markdown cell: A Markdown cell The top panel shows the cell in edit mode, while the bottom one shows it in render mode. The edit mode lets you edit the text, while the render mode lets you display the rendered cell. We will explain the differences between these modes in greater detail in the following section. Code cells Here is a screenshot of a complex code cell: Structure of a code cell This code cell contains several parts, as follows: The Prompt number shows the cell's number. This number increases every time you run the cell. Since you can run cells of a notebook out of order, nothing guarantees that code numbers are linearly increasing in a given notebook. The Input area contains a multiline text editor that lets you write one or several lines of code with syntax highlighting. The Widget area may contain graphical controls; here, it displays a slider. The Output area can contain multiple outputs, here: Standard output (text in black) Error output (text with a red background) Rich output (an HTML table and an image here) The Notebook modal interface The Notebook implements a modal interface similar to some text editors such as vim. Mastering this interface may represent a small learning curve for some users. Use the edit mode to write code (the selected cell has a green border, and a pen icon appears at the top right of the interface). Click inside a cell to enable the edit mode for this cell (you need to double-click with Markdown cells). Use the command mode to operate on cells (the selected cell has a gray border, and there is no pen icon). Click outside the text area of a cell to enable the command mode (you can also press the Esc key). Keyboard shortcuts are available in the Notebook interface. Type h to show them. We review here the most common ones (for Windows and Linux; shortcuts for Mac OS X may be slightly different). Keyboard shortcuts available in both modes Here are a few keyboard shortcuts that are always available when a cell is selected: Ctrl + Enter: run the cell Shift + Enter: run the cell and select the cell below Alt + Enter: run the cell and insert a new cell below Ctrl + S: save the notebook Keyboard shortcuts available in the edit mode In the edit mode, you can type code as usual, and you have access to the following keyboard shortcuts: Esc: switch to command mode Ctrl + Shift + -: split the cell Keyboard shortcuts available in the command mode In the command mode, keystrokes are bound to cell operations. Don't write code in command mode or unexpected things will happen! For example, typing dd in command mode will delete the selected cell! Here are some keyboard shortcuts available in command mode: Enter: switch to edit mode Up or k: select the previous cell Down or j: select the next cell y / m: change the cell type to code cell/Markdown cell a / b: insert a new cell above/below the current cell x / c / v: cut/copy/paste the current cell dd: delete the current cell z: undo the last delete operation Shift + =: merge the cell below h: display the help menu with the list of keyboard shortcuts Spending some time learning these shortcuts is highly recommended. References Here are a few references: Main documentation of Jupyter at http://jupyter.readthedocs.org/en/latest/ Jupyter Notebook interface explained at http://jupyter-notebook.readthedocs.org/en/latest/notebook.html A crash course on Python If you don't know Python, read this section to learn the fundamentals. Python is a very accessible language and is even taught to school children. If you have ever programmed, it will only take you a few minutes to learn the basics. Hello world Open a new notebook and type the following in the first cell: In [1]: print("Hello world!") Out[1]: Hello world! Here is a screenshot: "Hello world" in the Notebook [box type="note" align="alignleft" class="" width=""]Prompt string Note that the convention chosen in this article is to show Python code (also called the input) prefixed with In [x]: (which shouldn't be typed). This is the standard IPython prompt. Here, you should just type print("Hello world!") and then press Shift + Enter.[/box] Congratulations! You are now a Python programmer. Variables Let's use Python as a calculator. In [2]: 2 * 2 Out[2]: 4 Here, 2 * 2 is an expression statement. This operation is performed, the result is returned, and IPython displays it in the notebook cell's output. [box type="note" align="alignleft" class="" width=""]Division In Python 3, 3 / 2 returns 1.5 (floating-point division), whereas it returns 1 in Python 2 (integer division). This can be source of errors when porting Python 2 code to Python 3. It is recommended to always use the explicit 3.0 / 2.0 for floating-point division (by using floating-point numbers) and 3 // 2 for integer division. Both syntaxes work in Python 2 and Python 3. See http://python3porting.com/differences.html#integer-division for more details.[/box] Other built-in mathematical operators include +, -, ** for the exponentiation, and others. You will find more details at https://docs.python.org/3/reference/expressions.html#the-power-operator. Variables form a fundamental concept of any programming language. A variable has a name and a value. Here is how to create a new variable in Python: In [3]: a = 2 And here is how to use an existing variable: In [4]: a * 3 Out[4]: 6 Several variables can be defined at once (this is called unpacking): In [5]: a, b = 2, 6 There are different types of variables. Here, we have used a number (more precisely, an integer). Other important types include floating-point numbers to represent real numbers, strings to represent text, and booleans to represent True/False values. Here are a few examples: In [6]: somefloat = 3.1415 sometext = 'pi is about' # You can also use double quotes. print(sometext, somefloat) # Display several variables. Out[6]: pi is about 3.1415 Note how we used the # character to write comments. Whereas Python discards the comments completely, adding comments in the code is important when the code is to be read by other humans (including yourself in the future). String escaping String escaping refers to the ability to insert special characters in a string. For example, how can you insert ' and ", given that these characters are used to delimit a string in Python code? The backslash is the go-to escape character in Python (and in many other languages too). Here are a few examples: In [7]: print("Hello "world"") print("A list:n* item 1n* item 2") print("C:pathonwindows") print(r"C:pathonwindows") Out[7]: Hello "world" A list: * item 1 * item 2 C:pathonwindows C:pathonwindows The special character n is the new line (or line feed) character. To insert a backslash, you need to escape it, which explains why it needs to be doubled as . You can also disable escaping by using raw literals with a r prefix before the string, like in the last example above. In this case, backslashes are considered as normal characters. This is convenient when writing Windows paths, since Windows uses backslash separators instead of forward slashes like on Unix systems. A very common error on Windows is forgetting to escape backslashes in paths: writing "C:path" may lead to subtle errors. You will find the list of special characters in Python at https://docs.python.org/3.4/reference/lexical_analysis.html#string-and-bytes-literals. Lists A list contains a sequence of items. You can concisely instruct Python to perform repeated actions on the elements of a list. Let's first create a list of numbers as follows: In [8]: items = [1, 3, 0, 4, 1] Note the syntax we used to create the list: square brackets [], and commas , to separate the items. The built-in function len() returns the number of elements in a list: In [9]: len(items) Out[9]: 5 [box type="note" align="alignleft" class="" width=""]Python comes with a set of built-in functions, including print(), len(), max(), functional routines like filter() and map(), and container-related routines like all(), any(), range(), and sorted(). You will find the full list of built-in functions at https://docs.python.org/3.4/library/functions.html.[/box] Now, let's compute the sum of all elements in the list. Python provides a built-in function for this: In [10]: sum(items) Out[10]: 9 We can also access individual elements in the list, using the following syntax: In [11]: items[0] Out[11]: 1 In [12]: items[-1] Out[12]: 1 Note that indexing starts at 0 in Python: the first element of the list is indexed by 0, the second by 1, and so on. Also, -1 refers to the last element, -2, to the penultimate element, and so on. The same syntax can be used to alter elements in the list: In [13]: items[1] = 9 items Out[13]: [1, 9, 0, 4, 1] We can access sublists with the following syntax: In [14]: items[1:3] Out[14]: [9, 0] Here, 1:3 represents a slice going from element 1 included (this is the second element of the list) to element 3 excluded. Thus, we get a sublist with the second and third element of the original list. The first-included/last-excluded asymmetry leads to an intuitive treatment of overlaps between consecutive slices. Also, note that a sublist refers to a dynamic view of the original list, not a copy; changing elements in the sublist automatically changes them in the original list. Python provides several other types of containers: Tuples are immutable and contain a fixed number of elements: In [15]: my_tuple = (1, 2, 3) my_tuple[1] Out[15]: 2 Dictionaries contain key-value pairs. They are extremely useful and common: In [16]: my_dict = {'a': 1, 'b': 2, 'c': 3} print('a:', my_dict['a']) Out[16]: a: 1 In [17]: print(my_dict.keys()) Out[17]: dict_keys(['c', 'a', 'b']) There is no notion of order in a dictionary. However, the native collections module provides an OrderedDict structure that keeps the insertion order (see https://docs.python.org/3.4/library/collections.html). Sets, like mathematical sets, contain distinct elements: In [18]: my_set = set([1, 2, 3, 2, 1]) my_set Out[18]: {1, 2, 3} A Python object is mutable if its value can change after it has been created. Otherwise, it is immutable. For example, a string is immutable; to change it, a new string needs to be created. A list, a dictionary, or a set is mutable; elements can be added or removed. By contrast, a tuple is immutable, and it is not possible to change the elements it contains without recreating the tuple. See https://docs.python.org/3.4/reference/datamodel.html for more details. Loops We can run through all elements of a list using a for loop: In [19]: for item in items: print(item) Out[19]: 1 9 0 4 1 There are several things to note here: The for item in items syntax means that a temporary variable named item is created at every iteration. This variable contains the value of every item in the list, one at a time. Note the colon : at the end of the for statement. Forgetting it will lead to a syntax error! The statement print(item) will be executed for all items in the list. Note the four spaces before print: this is called the indentation. You will find more details about indentation in the next subsection. Python supports a concise syntax to perform a given operation on all elements of a list, as follows: In [20]: squares = [item * item for item in items] squares Out[20]: [1, 81, 0, 16, 1] This is called a list comprehension. A new list is created here; it contains the squares of all numbers in the list. This concise syntax leads to highly readable and Pythonic code. Indentation Indentation refers to the spaces that may appear at the beginning of some lines of code. This is a particular aspect of Python's syntax. In most programming languages, indentation is optional and is generally used to make the code visually clearer. But in Python, indentation also has a syntactic meaning. Particular indentation rules need to be followed for Python code to be correct. In general, there are two ways to indent some text: by inserting a tab character (also referred to as t), or by inserting a number of spaces (typically, four). It is recommended to use spaces instead of tab characters. Your text editor should be configured such that the Tab key on the keyboard inserts four spaces instead of a tab character. In the Notebook, indentation is automatically configured properly; so you shouldn't worry about this issue. The question only arises if you use another text editor for your Python code. Finally, what is the meaning of indentation? In Python, indentation delimits coherent blocks of code, for example, the contents of a loop, a conditional branch, a function, and other objects. Where other languages such as C or JavaScript use curly braces to delimit such blocks, Python uses indentation. Conditional branches Sometimes, you need to perform different operations on your data depending on some condition. For example, let's display all even numbers in our list: In [21]: for item in items: if item % 2 == 0: print(item) Out[21]: 0 4 Again, here are several things to note: An if statement is followed by a boolean expression. If a and b are two integers, the modulo operand a % b returns the remainder from the division of a by b. Here, item % 2 is 0 for even numbers, and 1 for odd numbers. The equality is represented by a double equal sign == to avoid confusion with the assignment operator = that we use when we create variables. Like with the for loop, the if statement ends with a colon :. The part of the code that is executed when the condition is satisfied follows the if statement. It is indented. Indentation is cumulative: since this if is inside a for loop, there are eight spaces before the print(item) statement. Python supports a concise syntax to select all elements in a list that satisfy certain properties. Here is how to create a sublist with only even numbers: In [22]: even = [item for item in items if item % 2 == 0] even Out[22]: [0, 4] This is also a form of list comprehension. Functions Code is typically organized into functions. A function encapsulates part of your code. Functions allow you to reuse bits of functionality without copy-pasting the code. Here is a function that tells whether an integer number is even or not: In [23]: def is_even(number): """Return whether an integer is even or not.""" return number % 2 == 0 There are several things to note here: A function is defined with the def keyword. After def comes the function name. A general convention in Python is to only use lowercase characters, and separate words with an underscore _. A function name generally starts with a verb. The function name is followed by parentheses, with one or several variable names called the arguments. These are the inputs of the function. There is a single argument here, named number. No type is specified for the argument. This is because Python is dynamically typed; you could pass a variable of any type. This function would work fine with floating point numbers, for example (the modulo operation works with floating point numbers in addition to integers). The body of the function is indented (and note the colon : at the end of the def statement). There is a docstring wrapped by triple quotes """. This is a particular form of comment that explains what the function does. It is not mandatory, but it is strongly recommended to write docstrings for the functions exposed to the user. The return keyword in the body of the function specifies the output of the function. Here, the output is a Boolean, obtained from the expression number % 2 == 0. It is possible to return several values; just use a comma to separate them (in this case, a tuple of Booleans would be returned). Once a function is defined, it can be called like this: In [24]: is_even(3) Out[24]: False In [25]: is_even(4) Out[25]: True Here, 3 and 4 are successively passed as arguments to the function. Positional and keyword arguments A Python function can accept an arbitrary number of arguments, called positional arguments. It can also accept optional named arguments, called keyword arguments. Here is an example: In [26]: def remainder(number, divisor=2): return number % divisor The second argument of this function, divisor, is optional. If it is not provided by the caller, it will default to the number 2, as shown here: In [27]: remainder(5) Out[27]: 1 There are two equivalent ways of specifying a keyword argument when calling a function. They are as follows: In [28]: remainder(5, 3) Out[28]: 2 In [29]: remainder(5, divisor=3) Out[29]: 2 In the first case, 3 is understood as the second argument, divisor. In the second case, the name of the argument is given explicitly by the caller. This second syntax is clearer and less error-prone than the first one. Functions can also accept arbitrary sets of positional and keyword arguments, using the following syntax: In [30]: def f(*args, **kwargs): print("Positional arguments:", args) print("Keyword arguments:", kwargs) In [31]: f(1, 2, c=3, d=4) Out[31]: Positional arguments: (1, 2) Keyword arguments: {'c': 3, 'd': 4} Inside the function, args is a tuple containing positional arguments, and kwargs is a dictionary containing keyword arguments. Passage by assignment When passing a parameter to a Python function, a reference to the object is actually passed (passage by assignment): If the passed object is mutable, it can be modified by the function If the passed object is immutable, it cannot be modified by the function Here is an example: In [32]: my_list = [1, 2] def add(some_list, value): some_list.append(value) add(my_list, 3) my_list Out[32]: [1, 2, 3] The add() function modifies an object defined outside it (in this case, the object my_list); we say this function has side-effects. A function with no side-effects is called a pure function: it doesn't modify anything in the outer context, and it deterministically returns the same result for any given set of inputs. Pure functions are to be preferred over functions with side-effects. Knowing this can help you spot out subtle bugs. There are further related concepts that are useful to know, including function scopes, naming, binding, and more. Here are a couple of links: Passage by reference at https://docs.python.org/3/faq/programming.html#how-do-i-write-a-function-with-output-parameters-call-by-reference Naming, binding, and scope at https://docs.python.org/3.4/reference/executionmodel.html Errors Let's discuss errors in Python. As you learn, you will inevitably come across errors and exceptions. The Python interpreter will most of the time tell you what the problem is, and where it occurred. It is important to understand the vocabulary used by Python so that you can more quickly find and correct your errors. Let's see the following example: In [33]: def divide(a, b): return a / b In [34]: divide(1, 0) Out[34]: --------------------------------------------------------- ZeroDivisionError Traceback (most recent call last) <ipython-input-2-b77ebb6ac6f6> in <module>() ----> 1 divide(1, 0) <ipython-input-1-5c74f9fd7706> in divide(a, b) 1 def divide(a, b): ----> 2 return a / b ZeroDivisionError: division by zero Here, we defined a divide() function, and called it to divide 1 by 0. Dividing a number by 0 is an error in Python. Here, a ZeroDivisionError exception was raised. An exception is a particular type of error that can be raised at any point in a program. It is propagated from the innards of the code up to the command that launched the code. It can be caught and processed at any point. You will find more details about exceptions at https://docs.python.org/3/tutorial/errors.html, and common exception types at https://docs.python.org/3/library/exceptions.html#bltin-exceptions. The error message you see contains the stack trace, the exception type, and the exception message. The stack trace shows all function calls between the raised exception and the script calling point. The top frame, indicated by the first arrow ---->, shows the entry point of the code execution. Here, it is divide(1, 0), which was called directly in the Notebook. The error occurred while this function was called. The next and last frame is indicated by the second arrow. It corresponds to line 2 in our function divide(a, b). It is the last frame in the stack trace: this means that the error occurred there. Object-oriented programming Object-oriented programming (OOP) is a relatively advanced topic. Although we won't use it much in this book, it is useful to know the basics. Also, mastering OOP is often essential when you start to have a large code base. In Python, everything is an object. A number, a string, or a function is an object. An object is an instance of a type (also known as class). An object has attributes and methods, as specified by its type. An attribute is a variable bound to an object, giving some information about it. A method is a function that applies to the object. For example, the object 'hello' is an instance of the built-in str type (string). The type() function returns the type of an object, as shown here: In [35]: type('hello') Out[35]: str There are native types, like str or int (integer), and custom types, also called classes, that can be created by the user. In IPython, you can discover the attributes and methods of any object with the dot syntax and tab completion. For example, typing 'hello'.u and pressing Tab automatically shows us the existence of the upper() method: In [36]: 'hello'.upper() Out[36]: 'HELLO' Here, upper() is a method available to all str objects; it returns an uppercase copy of a string. A useful string method is format(). This simple and convenient templating system lets you generate strings dynamically, as shown in the following example: In [37]: 'Hello {0:s}!'.format('Python') Out[37]: Hello Python The {0:s} syntax means "replace this with the first argument of format(), which should be a string". The variable type after the colon is especially useful for numbers, where you can specify how to display the number (for example, .3f to display three decimals). The 0 makes it possible to replace a given value several times in a given string. You can also use a name instead of a position—for example 'Hello {name}!'.format(name='Python'). Some methods are prefixed with an underscore _; they are private and are generally not meant to be used directly. IPython's tab completion won't show you these private attributes and methods unless you explicitly type _ before pressing Tab. In practice, the most important thing to remember is that appending a dot . to any Python object and pressing Tab in IPython will show you a lot of functionality pertaining to that object. Functional programming Python is a multi-paradigm language; it notably supports imperative, object-oriented, and functional programming models. Python functions are objects and can be handled like other objects. In particular, they can be passed as arguments to other functions (also called higher-order functions). This is the essence of functional programming. Decorators provide a convenient syntax construct to define higher-order functions. Here is an example using the is_even() function from the previous Functions section: In [38]: def show_output(func): def wrapped(*args, **kwargs): output = func(*args, **kwargs) print("The result is:", output) return wrapped The show_output() function transforms an arbitrary function func() to a new function, named wrapped(), that displays the result of the function, as follows: In [39]: f = show_output(is_even) f(3) Out[39]: The result is: False Equivalently, this higher-order function can also be used with a decorator, as follows: In [40]: @show_output def square(x): return x * x In [41]: square(3) Out[41]: The result is: 9 You can find more information about Python decorators at https://en.wikipedia.org/wiki/Python_syntax_and_semantics#Decorators and at http://www.thecodeship.com/patterns/guide-to-python-function-decorators/. Python 2 and 3 Let's finish this section with a few notes about Python 2 and Python 3 compatibility issues. There are still some Python 2 code and libraries that are not compatible with Python 3. Therefore, it is sometimes useful to be aware of the differences between the two versions. One of the most obvious differences is that print is a statement in Python 2, whereas it is a function in Python 3. Therefore, print "Hello" (without parentheses) works in Python 2 but not in Python 3, while print("Hello") works in both Python 2 and Python 3. There are several non-mutually exclusive options to write portable code that works with both versions: futures: A built-in module supporting backward-incompatible Python syntax 2to3: A built-in Python module to port Python 2 code to Python 3 six: An external lightweight library for writing compatible code Here are a few references: Official Python 2/3 wiki page at https://wiki.python.org/moin/Python2orPython3 The Porting to Python 3 book, by CreateSpace Independent Publishing Platform at http://www.python3porting.com/bookindex.html 2to3 at https://docs.python.org/3.4/library/2to3.html six at https://pythonhosted.org/six/ futures at https://docs.python.org/3.4/library/__future__.html The IPython Cookbook contains an in-depth recipe about choosing between Python 2 and 3, and how to support both. Going beyond the basics You now know the fundamentals of Python, the bare minimum that you will need in this book. As you can imagine, there is much more to say about Python. Following are a few further basic concepts that are often useful and that we cannot cover here, unfortunately. You are highly encouraged to have a look at them in the references given at the end of this section: range and enumerate pass, break, and, continue, to be used in loops Working with files Creating and importing modules The Python standard library provides a wide range of functionality (OS, network, file systems, compression, mathematics, and more) Here are some slightly more advanced concepts that you might find useful if you want to strengthen your Python skills: Regular expressions for advanced string processing Lambda functions for defining small anonymous functions Generators for controlling custom loops Exceptions for handling errors with statements for safely handling contexts Advanced object-oriented programming Metaprogramming for modifying Python code dynamically The pickle module for persisting Python objects on disk and exchanging them across a network Finally, here are a few references: Getting started with Python: https://www.python.org/about/gettingstarted/ A Python tutorial: https://docs.python.org/3/tutorial/index.html The Python Standard Library: https://docs.python.org/3/library/index.html Interactive tutorial: http://www.learnpython.org/ Codecademy Python course: http://www.codecademy.com/tracks/python Language reference (expert level): https://docs.python.org/3/reference/index.html Python Cookbook, by David Beazley and Brian K. Jones, O'Reilly Media (advanced level, highly recommended if you want to become a Python expert) Summary In this article, we have seen how to launch the IPython console and Jupyter Notebook, the different aspects of the Notebook and its user interface, the structure of the notebook cell, keyboard shortcuts that are available in the Notebook interface, and the basics of Python. Introduction to Data Analysis and Libraries Hand Gesture Recognition Using a Kinect Depth Sensor The strange relationship between objects, functions, generators and coroutines
Read more
  • 0
  • 0
  • 66789

article-image-first-principle-and-useful-way-think
Packt
08 Oct 2015
8 min read
Save for later

First Principle and a Useful Way to Think

Packt
08 Oct 2015
8 min read
In this article, by Timothy Washington, author of the book Clojure for Finance, we will cover the following topics: Modeling the stock price activity Function evaluation First-Class functions Lazy evaluation Basic Clojure functions and immutability Namespace modifications and creating our first function (For more resources related to this topic, see here.) Modeling the stock price activity There are many types of banks. Commercial entities (large stores, parking areas, hotels, and so on) that collect and retain credit card information, are either quasi banks, or farm out credit operations to bank-like processing companies. There are more well-known consumer banks, which accept demand deposits from the public. There are also a range of other banks such as commercial banks, insurance companies and trusts, credit unions, and in our case, investment banks. As promised, this article will slowly build up a set of lagging price indicators that follow a moving stock price time series. In order to do that, I think it's useful to touch on stock markets, and to crudely model stock price activity. A stock (or equity) market, is a collection of buyers and sellers trading economic assets (usually companies). The stock (or shares) of those companies can be equities listed on an exchange (New York Stock Exchange, London Stock Exchange, and others), or may be those traded privately. In this exercise, we will do the following: Crudely model the stock price movement, which will give us a test bed for writing our lagging price indicators Introduce some basic features of the Clojure language Function evaluation The Clojure website has a cheatsheet (http://clojure.org/cheatsheet) with all of the language's core functions. The first function we'll look at is rand, a function that randomly gives you a number within a given range. So in your edgar/ project, launch a repl with the lein repl shell command. After a few seconds, you will enter repl (Read-Eval-Print-Loop). Again, Clojure functions are executed by being placed in the first position of a list. The function's arguments are placed directly afterwards. In your repl, evaluate (rand 35) or (rand 99) or (rand 43.21) or any number you fancy Run it many times to see that you can get any different floating point number, within 0 and the upper bound of the number you provided First-Class functions The next functions we'll look at are repeatedly and fn. repeatedly is a function that takes another function and returns an infinite (or length n if supplied) lazy sequence of calls to the latter function. This is our first encounter of a function that can take another function. We'll also encounter functions that return other functions. Described as First-Class functions, this falls out of lambda calculus and is one of the central features of functional programming. As such, we need to wrap our previous (rand 35) call in another function. fn is one of Clojure's core functions, and produces an anonymous, unnamed function. We can now supply this function to repeatedly. In your repl, if you evaluate (take 25 (repeatedly (fn [] (rand 35)))), you should see a long list of floating point numbers with the list's tail elided. Lazy evaluation We only took the first 25 of the (repeatedly (fn [] (rand 35))) result list, because the list (actually a lazy sequence) is infinite. Lazy evaluation (or laziness) is a common feature in functional programming languages. Being infinite, Clojure chooses to delay evaluating most of the list until it's needed by some other function that pulls out some values. Laziness benefits us by increasing performance and letting us more easily construct control flow. We can avoid needless calculation, repeated evaluations, and potential error conditions in compound expressions. Let's try to pull out some values with the take function. take itself, returns another lazy sequence, of the first n items of a collection. Evaluating (take 25 (repeatedly (fn [] (rand 35)))) will pull out the first 25 repeatedly calls to rand which generates a float between 0 and 35. Basic Clojure functions and immutability There's many operations we can perform over our result list (or lazy sequence). One of the main approaches of functional programming is to take a data structure, and perform operations over top of it to produce a new data structure, or some atomic result (a string, number, and so on). This may sound inefficient at first. But most FP languages employ something called immutability to make these operations efficient. Immutable data structures are the ones that cannot change once they've been created. This is feasible as most immutable, FP languages use some kind of structural data sharing between an original and a modified version. The idea is that if we run evaluate (conj [1 2 3] 4), the resulting [1 2 3 4] vector shares the original vector of [1 2 3]. The only additional resource that's assigned is for any novelty that's been introduced to the data structure (the 4). There's a more detailed explanation of (for example) Clojure's persistent vectors here: conj: This conjoins an element to a collection—the collection decides where. So conjoining an element to a vector (conj [1 2 3] 4) versus conjoining an element to a list (conj '(1 2 3) 4) yield different results. Try it in your repl. map: This passes a function over one or many lists, yielding another list. (map inc [1 2 3]) increments each element by 1. reduce (or left fold): This passes a function over each element, accumulating one result. (reduce + (take 100 (repeatedly (fn [] (rand 35))))) sums the list. filter: This constrains the input by some condition. >=: This is a conditional function, which tests whether the first argument is greater than or equal to the second function. Try (>= 4 9) and (>= 9 1). fn: This is a function that creates a function. This unnamed or anonymous function can have any instructions you choose to put in there. So if we only want numbers above 12, we can put that assertion in a predicate function. Try entering the below expression into your repl: (take 25 (filter (fn [x] (>= x 12)) (repeatedly (fn [] (rand 35))))) Modifying the namespaces and creating our first function We now have the basis for creating a function. It will return a lazy infinite sequence of floating point numbers, within an upper and lower bound. defn is a Clojure function, which takes an anonymous function, and binds a name to it in a given namespace. A Clojure namespace is an organizational tool for mapping human-readable names to things like functions, named data structures and such. Here, we're going to bind our function to the name generate-prices in our current namespace. You'll notice that our function is starting to span multiple lines. This will be a good time to author the code in your text editor of choice. I'll be using Emacs: Open your text editor, and add this code to the file called src/edgar/core.clj. Make sure that (ns edgar.core) is at the top of that file. After adding the following code, you can then restart repl. (load "edgaru/core") uses the load function to load the Clojure code in your in src/edgaru/core.clj: (defn generate-prices [lower-bound upper-bound] (filter (fn [x] (>= x lower-bound)) (repeatedly (fn [] (rand upper-bound))))) The Read-Eval-Print-Loop In our repl, we can pull in code in various namespaces, with the require function. This applies to the src/edgar/core.clj file we've just edited. That code will be in the edgar.core namespace: In your repl, evaluate (require '[edgar.core :as c]). c is just a handy alias we can use instead of the long name. You can then generate random prices within an upper and lower bound. Take the first 10 of them like this (take 10 (c/generate-prices 12 35)). You should see results akin to the following output. All elements should be within the range of 12 to 35: (29.60706184716407 12.507593971664075 19.79939384292759 31.322074615579716 19.737852534147326 25.134649707849572 19.952195022152488 12.94569843904663 23.618693004455086 14.695872710062428) There's a subtle abstraction in the preceding code that deserves attention. (require '[edgar.core :as c]) introduces the quote symbol. ' is the reader shorthand for the quote function. So the equivalent invocation would be (require (quote [edgar.core :as c])). Quoting a form tells the Clojure reader not to evaluate the subsequent expression (or form). So evaluating '(a b c) returns a list of three symbols, without trying to evaluate a. Even though those symbols haven't yet been assigned, that's okay, because that expression (or form) has not yet been evaluated. But that begs a larger question. What is reader? Clojure (and all Lisps) are what's known as homoiconic. This means that Clojure code is also data. And data can be directly output and evaluated as code. The reader is the thing that parses our src/edgar/core.clj file (or (+ 1 1) input from the repl prompt), and produces the data structures that are evaluated. read and eval are the 2 essential processes by which Clojure code runs. The evaluation result is printed (or output), usually to the standard output device. Then we loop the process back to the read function. So, when the repl reads, your src/edgar/two.clj file, it's directly transforming that text representation into data and evaluating it. A few things fall out of that. For example, it becomes simpler for Clojure programs to directly read, transform and write out other Clojure programs. The implications of that will become clearer when we look at macros. But for now, know that there are ways to modify or delay the evaluation process, in this case by quoting a form. Summary In this article, we learned about basic features of the Clojure language and how to model the stock price activity. Besides these, we also learned function evaluation, First-Class functions, the lazy evaluation method, namespace modifications and creating our first function. Resources for Article: Further resources on this subject: Performance by Design[article] Big Data[article] The Observer Pattern [article]
Read more
  • 0
  • 0
  • 1091
article-image-integrating-elasticsearch-hadoop-ecosystem
Packt
07 Oct 2015
14 min read
Save for later

Integrating Elasticsearch with the Hadoop ecosystem

Packt
07 Oct 2015
14 min read
In this article by Vishal Shukla, author of the book Elasticsearch for Hadoop, we will take a look at how ES-Hadoop can integrate with Pig and Spark with ease. Elasticsearch is great in getting insights into the indexed data. The Hadoop ecosystem does a great job in making Hadoop easily usable for different users by providing a comfortable interface. Some of the examples are Hive and Pig. Apart from these, Hadoop integrates well with other computing engines and platforms, such as Spark and Cascading. (For more resources related to this topic, see here.) Pigging out Elasticsearch For many use cases, Pig is one of the easiest ways to fiddle around with the data in the Hadoop ecosystem. Pig wins when it comes to ease of use and simple syntax for designing data flow pipelines without getting into complex programming. Assuming that you know Pig, we will cover how to move the data to and from Elasticsearch. If you don't know Pig yet, never mind. You can still carry on with the steps, and by the end of the article, you will at least know how to use Pig to perform data ingestion and reading with Elasticsearch. Setting up Apache Pig for Elasticsearch Let's start by setting up Apache Pig. At the time of writing this article, the latest Pig version available is 0.15.0. You can use the following steps to set up the same version: First, download the Pig distribution using the following command: $ sudo wget –O /usr/local/pig.tar.gz http://mirrors.sonic.net/apache/pig/pig-0.15.0/pig-0.15.0.tar.gz Then, extract Pig to the desired location and rename it to a convenient name: $ cd /userusr/local $ sudo tar –xvf pig.tar.gz $ sudo mv pig-0.15.0 pig Now, export the required environment variables by appending the following two lines in the /home/eshadoop/.bashrc file: export PIG_HOME=/usr/local/pig export PATH=$PATH:$PIG_HOME/bin You can either log out and relogin to see the newly set environment variables or source the environment configuration with the following command: $ source ~/.bashrc Now, start the job history server daemon with the following command: $ mr-jobhistory-daemon.sh start historyserver You should see the Pig console with the following command: $ pig grunt> It's easy to forget to start the job history daemon once you restart your machine or VM. You may make this daemon run on start up, or you need to ensure this manually. Now, we have Pig up and running. In order to use Pig with Elasticsearch, we must ensure that the ES-Hadoop JAR file is available in the Pig classpath. Let's take the ES-Hadoop JAR file and and import it to HDFS using the following steps: First, download the ES-Hadoop JAR used to develop the examples in this article, as shown in the following command: $ wget http://central.maven.org/maven2/org/elasticsearch/elasticsearch-hadoop/2.1.1/elasticsearch-hadoop-2.1.1.jar Then, move the downloaded JAR to a convenient name as follows: $ sudo mkdir /opt/lib Now, import the JAR to HDFS: $ hadoop fs –mkdir /lib $ hadoop fs –put elasticsearch-hadoop-2.1.1.jar /lib/elasticsearch-hadoop-2.1.1.jar Throughout this article, we will use a crime dataset that is tailored from the open dataset provided at https://data.cityofchicago.org/. This tailored dataset can be downloaded from http://www.packtpub.com/support, where all the code files required for this article are available. Once you have downloaded the dataset, import it to HDFS at /ch07/crime_data.csv. Importing data to Elasticsearch Let's import the crime dataset to Elasticsearch using Pig with ES-Hadoop. This provides the EsStorage class as Pig Storage. In order to use the EsStorage class, you need to have a registered ES-Hadoop JAR with Pig. You can register the JAR located in the local filesystem, HDFS, or other shared filesystems. The REGISTER command registers a JAR file that contains UDFs (User-defined functions) with Pig, as shown in the following code: grunt> REGISTER hdfs://localhost:9000/lib/elasticsearch-hadoop-2.1.1.jar; Then, load the CSV data file as a relation with the following code: grunt> SOURCE = load '/ch07/crimes_dataset.csv' using PigStorage(',') as (id:chararray, caseNumber:chararray, date:datetime, block:chararray, iucr:chararray, primaryType:chararray, description:chararray, location:chararray, arrest:boolean, domestic:boolean, lat:double,lon:double); This command reads the CSV fields and maps each token in the data to the respective field in the preceding command. The resulting relation, SOURCE, represents a relation with the Bag data structure that contains multiple Tuples. Now, generate the target Pig relation that has the structure that matches closely to the target Elasticsearch index mapping, as shown in the following code: grunt> TARGET = foreach SOURCE generate id, caseNumber, date, block, iucr, primaryType, description, location, arrest, domestic, TOTUPLE(lon, lat) AS geoLocation; Here, we need the nested object with the geoLocation name in the target Elasticsearch document. We can achieve this with a Tuple to represent the lat and lon fields. TOTUPLE() helps us to create this tuple. We then assigned the geoLocation alias for this tuple. Let's store the TARGET relationto the Elasticsearch index with the following code: grunt> STORE TARGET INTO 'esh_pig/crimes' USING org.elasticsearch.hadoop.pig.EsStorage('es.http.timeout = 5m', 'es.index.auto.create = true', 'es.mapping.names=arrest:isArrest, domestic:isDomestic', 'es.mapping.id=id'); We can specify the target index and type to store indexed documents. The EsStorage class can accept multiple Elasticsearch configurations.es.mapping.names maps the Pig field name to Elasticsearch document's field name. You can use Pig's field id to assign a custom _id value for the Elasticsearch document using the es.mapping.id option. Similarly, you can set the _ttl and _timestamp metadata fields as well. Pig uses just one reducer in the default configuration. It is recommended to change this behavior to have a parallelism that matches the number of shards available, as shown in the following command: grunt> SET default_parallel 5; Pig also combines the input splits, irrespective of its size. This makes it efficient for small files by reducing the number of mappers. However, this will give performance issues for large files. You can disable this behavior in the Pig script, as shown in the following command: grunt> SET pig.splitCombination FALSE; Executing the preceding commands will create the Elasticsearch index and import crime data documents. If you observe the created documents in Elasticsearch, you can see the geoLocation value isan array in the [-87.74274476, 41.87404405]format. This is because by default, ES-Hadoop ignores the tuple field names and simply converts them as an ordered array. If you wish to make your geoLocation field look similar to the key/value-based object with the lat/lon keys, you can do so by including the following configuration in EsStorage: es.mapping.pig.tuple.use.field.names=true Writing from the JSON source If you have inputs as a well-formed JSON file, you can avoid conversion and transformations and directly pass the JSON document to Elasticsearch for indexing purposes. You may have the JSON data in Pig as chararray, bytearray, or in any other form that translates to well-formed JSON by calling the toString() method, as shown in the following code: grunt> JSON_DATA = LOAD '/ch07/crimes.json' USING PigStorage() AS (json:chararray); grunt> STORE JSON_DATA INTO 'esh_pig/crimes_json' USING org.elasticsearch.hadoop.pig.EsStorage('es.input.json=true'); Type conversions Take a look at the the type mapping of the esh_pig index in Elasticsearch. It maps the geoLocation type to double. This is done because Elasticsearch inferred the double type based on the field type we specified in Pig. To map geoLocation to geo_point, you must create the Elasticsearch mapping for it manually before executing the script. Although Elasticsearch provides a data type detection based on the type of field in the incoming document, it is always good to create the type mapping beforehand in Elasticsearch. This is a one-time activity that you should do. Then, you can run the MapReduce, Pig, Hive, Cascading, or Spark jobs multiple times. This will avoid any surprises in the type detection. For your reference, here is a list of some of the field types of Pig and Elasticsearch that map to each other. The table doesn't list no-brainer and absolutely intuitive type mappings: Pig type Elasticsearch type chararray This specifies string bytearray This indicates binary tuple This denotes an array(default) or object bag This specifies an array map This denotes an object bigdecimal This indicates Not supported biginteger This denotes Not supported Reading data from Elasticsearch Reading data from Elasticsearch using Pig is as simple as writing a single command with the Elasticsearch query. Here is a snippet of how to print tuples that has crimes related to theft: grunt> REGISTER hdfs://localhost:9000/lib/elasticsearch-hadoop-2.1.1.jar grunt> ES = LOAD 'esh_pig/crimes' using org.elasticsearch.hadoop.pig.EsStorage('{"query" : { "term" : { "primaryType" : "theft" } } }'); grunt> dump ES; Executing the preceding commands will print the tuples Pig console. Giving Spark to Elasticsearch Spark is a distributed computing system that provides huge performance boost compared to Hadoop MapReduce. It works on an abstraction of RDD (Resilient-distributed Datasets). This can be created for any data residing in Hadoop. Without any surprises, ES-Hadoop provides easy integration with Spark by enabling the creation of RDD from the data in Elasticsearch. Spark's increasing support of integrating with various data sources, such as HDFS, Parquet, Avro, S3, Cassandra, relational databases, and streaming data makes it special when it comes to data integration. This means that when you use ES-Hadoop with Spark, you can make all these sources integrate with Elasticsearch easily. Setting up Spark In order to set up Apache Spark in order to execute a job, you can perform the following steps: First, download the Apache Spark distribution with the following command: $ sudo wget –O /usr/local/spark.tgzhttp://www.apache.org/dyn/closer.cgi/spark/spark-1.4.1/spark-1.4.1-bin-hadoop2.4.tgz Then, extract Spark to the desired location and rename it to a convenient name, as shown in the following command: $ cd /user/local $ sudo tar –xvf spark.tgz $ sudo mv spark-1.4.1-bin-hadoop2.4 spark Importing data to Elasticsearch To import the crime dataset to Elasticsearch with Spark, let's see how we can write a Spark job. We will continue using Java to write Spark jobs for consistency. Here are the driver program's snippets: SparkConf conf = new SparkConf().setAppName("esh-spark").setMaster("local[4]"); conf.set("es.index.auto.create", "true"); JavaSparkContext context = new JavaSparkContext(conf); Set up the SparkConf object to configure the spark job. As always, you can also set most options (such as es.index.auto.create) and other configurations that we have seen throughout the article. Using this configuration, we created the JavaSparkContext object as follows: JavaRDD<String> textFile = context.textFile("hdfs://localhost:9000/ch07/crimes_dataset.csv"); Read the crime data CSV file as JavaRDD. Here, RDD is still of the type String that represents each line: JavaRDD<Crime> dataSplits = textFile.map(new Function<String, Crime>() { @Override public Crime call(String line) throws Exception { CSVParser parser = CSVParser.parse(line, CSVFormat.RFC4180); Crime c = new Crime(); CSVRecord record = parser.getRecords().get(0); c.setId(record.get(0)); .. .. String lat = record.get(10); String lon = record.get(11); Map<String, Double> geoLocation = new HashMap<>(); geoLocation.put("lat", StringUtils.isEmpty(lat)? null:Double.parseDouble(lat)); geoLocation.put("lon",StringUtils.isEmpty(lon)?null:Double. parseDouble(lon)); c.setGeoLocation(geoLocation); return c; } }); In the preceding snippet, we called the map() method on JavaRDD to map each of the input line to the Crime object. Note that we created a simple JavaBean class called Crime that implements the Serializable interface and maps to the Elasticsearch document structure. Using CSVParser, we parsed each field into the Crime object. We mapped nested the geoLocation object by embedding Map in the Crime object. This map is populated with the lat and lon fields. This map() method returns another JavaRDD that contains the Crime objects, as shown in the following code: JavaEsSpark.saveToEs(dataSplits, "esh_spark/crimes"); Save JavaRDD<Crime> to Elasticsearch with the JavaEsSpark class provided by Elasticsearch. For all the ES-Hadoop integrations, such as Pig, Hive, Cascading, Apache Storm, and Spark, you can use all the standard ES-Hadoop configurations and techniques. This includes dynamic/multiresource writes with a pattern similar to esh_spark/{primaryType} and use JSON strings to directly import the data to Elasticsearch as well. To control the Elasticsearch document metadata from being indexed, you can use the saveToEsWithMeta() method of JavaEsSpark. You can pass an instance of JavaPairRDD that contains Tuple2<Metadata, Object>, where Metadata represents a map that has the key/value pairs of the document metadata fields, such as id, ttl, timestamp, and version. Using SparkSQL ES-Hadoop also bridges Elasticsearch with the SparkSQL module. SparkSQL 1.3+ versions provide the DataFrame abstraction that represent a collection of Row. We will not discuss the details of DataFrame here. ES-Hadoop lets you persist your DataFrame instance to Elasticsearch transparently. Let's see how we can do this with the following code: SQLContext sqlContext = new SQLContext(context); DataFrame df = sqlContext.createDataFrame(dataSplits, Crime.class); Create an SQLContext instance using the JavaSparkContext instance. Using the SqlContextSqlContext instance, you can create DataFrame by calling the createDataFrame() method and passing the existing JavaRDD<T> and Class<T>, where T is a JavaBean class that implements the Serializable interface. Note that the passing class instance is required to infer a schema for DataFrame. If you wish to use nonJavaBean-based RDD, you can create the schema manually. The article source code contains the implementations of both the approaches for your reference. Take a look at the following code: JavaEsSparkSQL.saveToEs(df, "esh_sparksql/crimes_reflection"); Once you have the DataFrame instance, you can save it to Elasticsearch with the JavaEsSparkSQL class, as shown in the preceding code. Reading data from Elasticsearch Here is the snippet of SparkEsReader that finds crimes related to theft: JavaRDD<Map<String, Object>> esRDD = JavaEsSpark.esRDD(context, "esh_spark/crimes", "{"query" : { "term" : { "primaryType" : "theft" } } }").values(); for(Map<String,Object> item: esRDD.collect()){ System.out.println(item); } We used the same JavaEsSpark class to create RDD with documents that match the Elasticsearch query. Using SparkSQL ES-Hadoop provides a org.elasticsearch.spark.sql data source provider to read the data from Elasticsearch using SparkSQL, as shown in the following code: Map<String, String> options = new HashMap<>(); options.put("pushdown","true"); options.put("es.nodes","localhost"); DataFrame df = sqlContext.read() .options(options) .format("org.elasticsearch.spark.sql") .load("esh_sparksql/crimes_reflection"); The preceding code snippet uses the org.elasticsearch.spark.sql data source to load data from Elasticsearch. You can set the pushdown option to true to push the query execution down to Elasticsearch. This greatly increases its efficiency as the query execution is collocated where the data resides, as shown in the following code: df.registerTempTable("crimes"); DataFrame theftCrimes = sqlContext.sql("SELECT * FROM crimes WHERE primaryType='THEFT'"); for(Row row: theftCrimes.javaRDD().collect()){ System.out.println(row); } We registered table with the data frame and executed the SQL query on SqlContext. Note that we need to collect the final results locally to print in a driver class. Summary In this article, we looked at the various Hadoop ecosystem technologies. We set up Pig with ES-Hadoop and developed the script to interact with Elasticsearch. You also learned how to use ES-Hadoop to integrate Elasticsearch with Spark and empower it with powerful SQL engine SparkSQL. Resources for Article: Further resources on this subject: Extending ElasticSearch with Scripting [Article] Elasticsearch Administration [Article] Downloading and Setting Up ElasticSearch [Article]
Read more
  • 0
  • 0
  • 3237

article-image-introduction-data-analysis-and-libraries
Packt
07 Oct 2015
13 min read
Save for later

Introduction to Data Analysis and Libraries

Packt
07 Oct 2015
13 min read
In this article by Martin Czygan and Phuong Vothihong, the authors of the book Getting Started with Python Data Analysis, Data is raw information that can exist in any form, which is either usable or not. We can easily get data everywhere in our life; for example, the price of gold today is $ 1.158 per ounce. It does not have any meaning, except describing the gold price. This also shows that data is useful based on context. (For more resources related to this topic, see here.) With relational data connection, information appears and allows us to expand our knowledge beyond the range of our senses. When we possess gold price data gathered overtime, one information we might have is that the price has continuously risen from $1.152 to $1.158 for three days. It is used by someone who tracks gold prices. Knowledge helps people to create value in their lives and work. It is based on information that is organized, synthesized, or summarized to enhance comprehension, awareness, or understanding. It represents a state or potential for action and decisions. When the gold price continuously increases for three days, it will lightly decrease on the next day; this is useful knowledge. The following figure illustrates the steps from data to knowledge; we call this process the data analysis process and we will introduce it in the next section: In this article, we will cover the following topics: Data analysis and process Overview of libraries in data analysis using different programming languages Common Python data analysis libraries Data analysis and process Data is getting bigger and more diversified every day. Therefore, analyzing and processing data to advance human knowledge or to create value are big challenges. To tackle these challenges, you will need domain knowledge and a variety of skills, drawing from areas such as computer science, artificial intelligence (AI) and machine learning (ML), statistics and mathematics, and knowledge domain, as shown in the following figure: Let's us go through the Data analysis and it's domain knowledge: Computer science: We need this knowledge to provide abstractions for efficient data processing. A basic Python programming experience is required. We will introduce Python libraries used in data analysis. Artificial intelligence and machine learning: If computer science knowledge helps us to program data analysis tools, artificial intelligence and machine learning help us to model the data and learn from it in order to build smart products. Statistics and mathematics: We cannot extract useful information from raw data if we do not use statistical techniques or mathematical functions. Knowledge domain: Besides technology and general techniques, it is important to have an insight into the specific domain. What do the data fields mean? What data do we need to collect? Based on the expertise, we explore and analyze raw data by applying the above techniques, step by step. Data analysis is a process composed of the following steps: Data requirements: We have to define what kind of data will be collected based on the requirements or problem analysis. For example, if we want to detect a user's behavior while reading news on the internet, we should be aware of visited article links, date and time, article categories, and the user's time spent on different pages. Data collection: Data can be collected from a variety of sources: mobile, personal computer, camera, or recording device. It may also be obtained through different ways: communication, event, and interaction between person and person, person and device, or device and device. Data appears whenever and wherever in the world. The problem is, how can we find and gather it to solve our problem? This is the mission of this step. Data processing: Data that is initially obtained must be processed or organized for analysis. This process is performance-sensitive: How fast can we create, insert, update, or query data? For building a real product that has to process big data, we should consider this step carefully. What kind of database should we use to store data? What kind of data structure, such as analysis, statistics, or visualization, is suitable for our purposes? Data cleaning: After being processed and organized, the data may still contain duplicates or errors. Therefore, we need a cleaning step to reduce those situations and increase the quality of the results in the following steps. Common tasks include record matching, deduplication, or column segmentation. Depending on the type of data, we can apply several types of data cleaning. For example, a user's history of a visited news website might contain a lot of duplicate rows, because the user might have refreshed certain pages many times. For our specific issue, these rows might not carry any meaning when we explore the user's behavior. So, we should remove them before saving it to our database. Another situation we may encounter is click fraud on news—someone just wants to improve their website ranking or sabotage the website. In this case, the data will not help us to explore a user's behavior. We can use thresholds to check whether a visit page event comes from a real person or from malicious software. Exploratory data analysis: Now, we can start to analyze data through a variety of techniques referred to as exploratory data analysis. We may detect additional problems in data cleaning or discover requests for further data. Therefore, these steps may be iterative and repeated throughout the whole data analysis process. Data visualization techniques are also used to examine the data in graphs or charts. Visualization often facilitates the understanding of data sets, especially, if they are large or high-dimensional. Modelling and algorithms: A lot of mathematical formulas and algorithms may be applied to detect or predict useful knowledge from the raw data. For example, we can use similarity measures to cluster users who have exhibited similar news reading behavior and recommend articles of interest to them next time. Alternatively, we can detect users' gender based on their news reading behavior by applying classification models such as Support Vector Machine (SVM) or linear regression. Depending on the problem, we may use different algorithms to get an acceptable result. It can take a lot of time to evaluate the accuracy of the algorithms and to choose the best one to implement for a certain product. Data product: The goal of this step is to build data products that receive data input and generate output according to the problem requirements. We will apply computer science knowledge to implement our selected algorithms as well as manage the data storage. Overview of libraries in data analysis There are numerous data analysis libraries that help us to process and analyze data. They use different programming languages and have different advantages as well as disadvantages of solving various data analysis problems. Now, we introduce some common libraries that may be useful for you. They should give you an overview of libraries in the field. However, the rest of this focuses on Python-based libraries. Some of the libraries that use the Java language for data analysis are as follows: Weka: This is the library that I got familiar with, the first time I learned about data analysis. It has a graphical user interface that allows you to run experiments on a small dataset. This is great if you want to get a feel for what is possible in the data processing space. However, if you build a complex product, I think it is not the best choice because of its performance: sketchy API design, non-optimal algorithms, and little documentation (http://www.cs.waikato.ac.nz/ml/weka/). Mallet: This is another Java library that is used for statistical natural language processing, document classification, clustering, topic modelling, information extraction, and other machine learning applications on text. There is an add-on package to Mallet, called GRMM, that contains support for inference in general, graphical models, and training of Conditional random fields (CRF) with arbitrary graphical structure. In my experience, the library performance as well as the algorithms are better than Weka. However, its only focus is on text processing problems. The reference page is at http://mallet.cs.umass.edu/. Mahout: This is Apache's machine learning framework built on top of Hadoop; its goal is to build a scalable machine learning library. It looks promising, but comes with all the baggage and overhead of Hadoop. The Homepage is at http://mahout.apache.org/. Spark: This is a relatively new Apache project; supposedly up to a hundred times faster than Hadoop. It is also a scalable library that consists of common machine learning algorithms and utilities. Development can be done in Python as well as in any JVM language. The reference page is at https://spark.apache.org/docs/1.5.0/mllib-guide.html. Here are a few libraries that are implemented in C++: Vowpal Wabbit: This library is a fast out-of-core learning system sponsored by Microsoft Research and (previously) Yahoo! Research. It has been used to learn a tera-feature (1012) dataset on 1000 nodes in one hour. More information can be found in the publication at http://arxiv.org/abs/1110.4198. MultiBoost: This package is a multiclass, multilabel, and multitask classification boosting software implemented in C++. If you use this software, you should refer to the paper published in 2012, in the Journal of Machine Learning Research, MultiBoost: A Multi-purpose Boosting Package, D.Benbouzid, R. Busa-Fekete, N. Casagrande, F.-D. Collin, and B. Kégl. MLpack: This is also a C++ machine learning library, developed by the Fundamental Algorithmic and Statistical Tools Laboratory (FASTLab) at Georgia Tech. It focusses on scalability, speed, and ease-of-use and was presented at the BigLearning workshop of NIPS 2011. Its homepage is at http://www.mlpack.org/about.html. Caffe: The last C++ library we want to mention is Caffe. This is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and community contributors. You can find more information about it at http://caffe.berkeleyvision.org/. Other libraries for data processing and analysis are as follows: Statsmodels: This is a great Python library for statistical modelling and is mainly used for predictive and exploratory analysis. Modular toolkit for data processing (MDP):This is a collection of supervised and unsupervised learning algorithms and other data processing units that can be combined into data processing sequences and more complex feed-forward network architectures (http://mdp-toolkit.sourceforge.net/index.html). Orange: This is an open source data visualization and analysis for novices and experts. It is packed with features for data analysis and has add-ons for bioinformatics and text mining. It contains an implementation of self-organizing maps, which sets it apart from the other projects as well (http://orange.biolab.si/). Mirador: This is a tool for the visual exploration of complex datasets supporting Mac and Windows. It enables users to discover correlation patterns and derive new hypotheses from data (http://orange.biolab.si/). RapidMiner: This is another GUI-based tool for data mining, machine learning, and predictive analysis (https://rapidminer.com/). Theano: This bridges the gap between Python and lower-level languages. Theano gives very significant performance gains, particularly for large matrix operations and is, therefore, a good choice for deep learning models. However, it is not easy to debug because of the additional compilation layer. Natural language processing toolkit (NLTK): This is written in Python with very unique and salient features. Here, I could not list all libraries for data analysis. However, I think the above libraries are enough to take a lot of your time to learn and build data analysis applications. Python libraries in data analysis Python is a multi-platform, general purpose programming language that can run on Windows, Linux/Unix, and Mac OS X, and has been ported to the Java and the .NET virtual machines as well. It has a powerful standard library. In addition, it has many libraries for data analysis: Pylearn2, Hebel, Pybrain, Pattern, MontePython, and MILK. We will cover some common Python data analysis libraries such as Numpy, Pandas, Matplotlib, PyMongo, and scikit-learn. Now, to help you getting started, I will briefly present an overview of each library for those who are less familiar with the scientific Python stack. NumPy One of the fundamental packages used for scientific computing with Python is Numpy. Among other things, it contains the following: A powerful N-dimensional array object Sophisticated (broadcasting) functions for performing array computations Tools for integrating C/C++ and Fortran code Useful linear algebra operations, Fourier transforms, and random number capabilities. Besides this, it can also be used as an efficient multidimensional container of generic data. Arbitrary data types can be defined and integrated with a wide variety of databases. Pandas Pandas is a Python package that supports rich data structures and functions for analyzing data and is developed by the PyData Development Team. It is focused on the improvement of Python's data libraries. Pandas consists of the following things: A set of labelled array data structures; the primary of which are Series, DataFrame, and Panel Index objects enabling both simple axis indexing and multilevel/hierarchical axis indexing An integrated group by engine for aggregating and transforming datasets Date range generation and custom date offsets Input/output tools that loads and saves data from flat files or PyTables/HDF5 format Optimal memory versions of the standard data structures Moving window statistics and static and moving window linear/panel regression Because of these features, Pandas is an ideal tool for systems that need complex data structures or high-performance time series functions such as financial data analysis applications. Matplotlib Matplotlib is the single most used Python package for 2D-graphic. It provides both a very quick way to visualize data from Python and publication-quality figures in many formats: line plots, contour plots, scatter plots, or Basemap plot. It comes with a set of default settings, but allows customizing all kinds of properties. However, we can easily create our chart with the defaults of almost every property in Matplotlib. PyMongo MongoDB is a type of NoSQL database. It is highly scalable, robust, and perfect to work with JavaScript-based web applications because we can store data as JSON documents and use flexible schemas. PyMongo is a Python distribution containing tools for working with MongoDB. Many tools have also been written for working with PyMongo to add more features such as MongoKit, Humongolus, MongoAlchemy, and Ming. scikit-learn scikit-learn is an open source machine learning library using the Python programming language. It supports various machine learning models, such as classification, regression, and clustering algorithms, interoperated with the Python numerical and scientific libraries NumPy and SciPy. The latest scikit-learn version is 0.16.1, published in April 2015. Summary In this article, there were three main points that we presented. Firstly, we figured out the relationship between raw data, information and knowledge. Because of its contribution in our life, we continued to discuss an overview of data analysis and processing steps in the second part. Finally, we introduced a few common supported libraries that are useful for practical data analysis applications. Among those we will focus on Python libraries in data analysis. Resources for Article: Further resources on this subject: Exploiting Services with Python [Article] Basics of Jupyter Notebook and Python [Article] How to do Machine Learning with Python [Article]
Read more
  • 0
  • 0
  • 7998

article-image-fingerprint-detection-using-opencv
Packt
07 Oct 2015
11 min read
Save for later

Fingerprint detection using OpenCV 3

Packt
07 Oct 2015
11 min read
In this article by Joseph Howse, Quan Hua, Steven Puttemans, and Utkarsh Sinha, the authors of OpenCV Blueprints, we delve into the aspect of fingerprint detection using OpenCV. (For more resources related to this topic, see here.) Fingerprint identification, how is it done? We have already discussed the use of the first biometric, which is the face of the person trying to login to the system. However since we mentioned that using a single biometric can be quite risky, we suggest adding secondary biometric checks to the system, like the fingerprint of a person. There are many of the shelf fingerprint scanners which are quite cheap and return you the scanned image. However you will still have to write your own registration software for these scanners and which can be done by using OpenCV software. Examples of such fingerprint images can be found below. Examples of single individual thumb fingerprint in different scanning positions This dataset can be downloaded from the FVC2002 competition website released by the University of Bologna. The website (http://bias.csr.unibo.it/fvc2002/databases.asp) contains 4 databases of fingerprints available for public download of the following structure: Four fingerprint capturing devices DB1 - DB4. For each device, the prints of 10 individuals are available. For each person, 8 different positions of prints were recorded. We will use this publicly available dataset to build our system upon. We will focus on the first capturing device, using up to 4 fingerprints of each individual for training the system and making an average descriptor of the fingerprint. Then we will use the other 4 fingerprints to evaluate our system and make sure that the person is still recognized by our system. You could apply exactly the same approach on the data grabbed from the other devices if you would like to investigate the difference between a system that captures almost binary images and one that captures grayscale images. However we will provide techniques for doing the binarization yourself. Implement the approach in OpenCV 3 The complete fingerprint software for processing fingerprints derived from a fingerprint scanner can be found at https://github.com/OpenCVBlueprints/OpenCVBlueprints/tree/master/chapter_6/source_code/fingerprint/fingerprint_process/. In this subsection we will describe how you can implement this approach in the OpenCV interface. We will start by grabbing the image from the fingerprint system and apply binarization. This will enable us to remove any desired noise from the image as well as help us to make the contrast better between the kin and the wrinkled surface of the finger. // Start by reading in an image Mat input = imread("/data/fingerprints/image1.png", CV_LOAD_GRAYSCALE); // Binarize the image, through local thresholding Mat input_binary; threshold(input, input_binary, 0, 255, CV_THRESH_BINARY | CV_THRESH_OTSU); The Otsu thresholding will automatically choose the best generic threshold for the image to obtain a good contrast between foreground and background information. This is because the image contains a bimodal distribution (which means that we have an image with a 2 peak histogram) of pixel values. For that image, we can approximately take a value in the middle of those peaks as threshold value. (for images which are not bimodal, binarization won't be accurate.) Otsu allows us to avoid using a fixed threshold value, and thus making the system more general for any capturing device. However we do acknowledge that if you have only a single capturing device, then playing around with a fixed threshold value could result in a better image for that specific setup. The result of the thresholding can be seen below. In order to make thinning from the next skeletization step as effective as possible we need the inverse binary image. Comparison between grayscale and binarized fingerprint image Once we have a binary image we are actually already set to go to calculate our feature points and feature point descriptors. However, in order to improve the process a bit more, we suggest to skeletize the image. This will create more unique and stronger interest points. The following piece of code can apply the skeletization on top of the binary image. The skeletization is based on the Zhang-Suen line thinning approach. Special thanks to @bsdNoobz of the OpenCV Q&A forum who supplied this iteration approach. #include <opencv2/imgproc.hpp> #include <opencv2/highgui.hpp> using namespace std; using namespace cv; // Perform a single thinning iteration, which is repeated until the skeletization is finalized void thinningIteration(Mat& im, int iter) { Mat marker = Mat::zeros(im.size(), CV_8UC1); for (int i = 1; i < im.rows-1; i++) { for (int j = 1; j < im.cols-1; j++) { uchar p2 = im.at<uchar>(i-1, j); uchar p3 = im.at<uchar>(i-1, j+1); uchar p4 = im.at<uchar>(i, j+1); uchar p5 = im.at<uchar>(i+1, j+1); uchar p6 = im.at<uchar>(i+1, j); uchar p7 = im.at<uchar>(i+1, j-1); uchar p8 = im.at<uchar>(i, j-1); uchar p9 = im.at<uchar>(i-1, j-1); int A = (p2 == 0 && p3 == 1) + (p3 == 0 && p4 == 1) + (p4 == 0 && p5 == 1) + (p5 == 0 && p6 == 1) + (p6 == 0 && p7 == 1) + (p7 == 0 && p8 == 1) + (p8 == 0 && p9 == 1) + (p9 == 0 && p2 == 1); int B = p2 + p3 + p4 + p5 + p6 + p7 + p8 + p9; int m1 = iter == 0 ? (p2 * p4 * p6) : (p2 * p4 * p8); int m2 = iter == 0 ? (p4 * p6 * p8) : (p2 * p6 * p8); if (A == 1 && (B >= 2 && B <= 6) && m1 == 0 && m2 == 0) marker.at<uchar>(i,j) = 1; } } im &= ~marker; } // Function for thinning any given binary image within the range of 0-255. If not you should first make sure that your image has this range preset and configured! void thinning(Mat& im) { // Enforce the range tob e in between 0 - 255 im /= 255; Mat prev = Mat::zeros(im.size(), CV_8UC1); Mat diff; do { thinningIteration(im, 0); thinningIteration(im, 1); absdiff(im, prev, diff); im.copyTo(prev); } while (countNonZero(diff) > 0); im *= 255; } The code above can then simply be applied to our previous steps by calling the thinning function on top of our previous binary generated image. The code for this is: Apply thinning algorithm Mat input_thinned = input_binary.clone(); thinning(input_thinned); This will result in the following output: Comparison between binarized and thinned fingerprint image using skeletization techniques. When we got this skeleton image, the following step would be to look for crossing points on the ridges of the fingerprint, which are being then called minutiae points. We can do this by a keypoint detector that looks at a large change in local contrast, like the Harris corner detector. Since the Harris corner detector is both able to detect strong corners and edges, this is ideally for the fingerprint problem, where the most important minutiae are short edges and bifurcation, the positions where edges come together. More information about minutae points and Harris Corner detection can be found in the following publications: Ross, Arun A., Jidnya Shah, and Anil K. Jain. "Toward reconstructing fingerprints from minutiae points." Defense and Security. International Society for Optics and Photonics, 2005. Harris, Chris, and Mike Stephens. "A combined corner and edge detector." Alvey vision conference. Vol. 15. 1988. Calling the Harris Corner operation on a skeletonized and binarized image in OpenCV is quite straightforward. The Harris corners are stored as positions corresponding in the image with their cornerness response value. If we want to detect points with a certain cornerness, than we should simply threshold the image. Mat harris_corners, harris_normalised; harris_corners = Mat::zeros(input_thinned.size(), CV_32FC1); cornerHarris(input_thinned, harris_corners, 2, 3, 0.04, BORDER_DEFAULT); normalize(harris_corners, harris_normalised, 0, 255, NORM_MINMAX, CV_32FC1, Mat()); We now have a map with all the available corner responses rescaled to the range of [0 255] and stored as float values. We can now manually define a threshold, that will generate a good amount of keypoints for our application. Playing around with this parameter could improve performance in other cases. This can be done by using the following code snippet: float threshold = 125.0; vector<KeyPoint> keypoints; Mat rescaled; convertScaleAbs(harris_normalised, rescaled); Mat harris_c(rescaled.rows, rescaled.cols, CV_8UC3); Mat in[] = { rescaled, rescaled, rescaled }; int from_to[] = { 0,0, 1,1, 2,2 }; mixChannels( in, 3, &harris_c, 1, from_to, 3 ); for(int x=0; x<harris_normalised.cols; x++){ for(int y=0; y<harris_normalised.rows; y++){ if ( (int)harris_normalised.at<float>(y, x) > threshold ){ // Draw or store the keypoint location here, just like you decide. In our case we will store the location of the keypoint circle(harris_c, Point(x, y), 5, Scalar(0,255,0), 1); circle(harris_c, Point(x, y), 1, Scalar(0,0,255), 1); keypoints.push_back( KeyPoint (x, y, 1) ); } } } Comparison between thinned fingerprint and Harris corner response, as well as the selected Harris corners. Now that we have a list of keypoints we will need to create some of formal descriptor of the local region around that keypoint to be able to uniquely identify it among other keypoints. Chapter 3, Recognizing facial expressions with machine learning, discusses in more detail the wide range of keypoints out there. In this article, we will mainly focus on the process. Feel free to adapt the interface with other keypoint detectors and descriptors out there, for better or for worse performance. Since we have an application where the orientation of the thumb can differ (since it is not a fixed position), we want a keypoint descriptor that is robust at handling these slight differences. One of the mostly used descriptors for that is the SIFT descriptor, which stands for scale invariant feature transform. However SIFT is not under a BSD license and can thus pose problems to use in commercial software. A good alternative in OpenCV is the ORB descriptor. In OpenCV you can implement it in the following way. Ptr<Feature2D> orb_descriptor = ORB::create(); Mat descriptors; orb_descriptor->compute(input_thinned, keypoints, descriptors); This will enable us to calculate only the descriptors using the ORB approach, since we already retrieved the location of the keypoints using the Harris corner approach. At this point we can retrieve a descriptor for each detected keypoint of any given fingerprint. The descriptors matrix will contain a row for each keypoint containing the representation. Let us now start from the case where we have only a single reference image for each fingerprint. In that case we will have a database containing a set of feature descriptors for the training persons in the database. We then have a single new entry, consisting of multiple descriptors for the keypoints found at registration time. We now have to match these descriptors to the descriptors stored in the database, to see which one has the best match. The most simple way is by performing a brute force matching using the hamming distance criteria between descriptors of different keypoints. // Imagine we have a vector of single entry descriptors as a database // We will still need to fill those once we compare everything, by using the code snippets above vector<Mat> database_descriptors; Mat current_descriptors; // Create the matcher interface Ptr<DescriptorMatcher> matcher = DescriptorMatcher::create("BruteForce-Hamming"); // Now loop over the database and start the matching vector< vector< DMatch > > all_matches; for(int entry=0; i<database_descriptors.size();entry++){ vector< DMatch > matches; matcheràmatch(database_descriptors[entry], current_descriptors, matches); all_matches.push_back(matches); } We now have all the matches stored as DMatch objects. This means that for each matching couple we will have the original keypoint, the matched keypoint and a floating point score between both matches, representing the distance between the matched points. The idea about finding the best match seems pretty straightforward. We take a look at the amount of matches that have been returned by the matching process and weigh them by their Euclidean distance in order to add some certainty. We then look for the matching process that yielded the biggest score. This will be our best match and the match we want to return as the selected one from the database. If you want to avoid an imposter getting assigned to the best matching score, you can again add a manual threshold on top of the scoring to avoid matches that are not good enough, to be ignored. However it is possible, and should be taken into consideration, that if you increase the score to high, that people with a little change will be rejected from the system, like for example in the case where someone cuts his finger and thus changing his pattern to drastically. Fingerprint matching process visualized. To summarize, we saw how to detect fingerprints and implement it using OpenCV 3. Resources for Article: Further resources on this subject: Making subtle color shifts with curves [article] Tracking Objects in Videos [article] Hand Gesture Recognition Using a Kinect Depth Sensor [article]
Read more
  • 0
  • 1
  • 26860
article-image-creating-interactive-spreadsheets-using-tables-and-slicers
Packt
06 Oct 2015
10 min read
Save for later

Creating Interactive Spreadsheets using Tables and Slicers

Packt
06 Oct 2015
10 min read
In this article by Hernan D Rojas author of the book Data Analysis and Business Modeling with Excel 2013 introduces additional materials to the advanced Excel developer in the presentation stage of the data analysis life cycle. What you will learn in this article is how to leverage Excel's new features to add interactivity to your spreadsheets. (For more resources related to this topic, see here.) What are slicers? Slicers are essentially buttons that automatically filter your data. Excel has always been able to filter data, but slicers are more practical and visually appealing. Let's compare the two in the following steps: First, fire up Excel 2013, and create a new spreadsheet. Manually enter the data, as shown in the following figure: Highlight cells A1 through E11, and press Ctrl + T to convert our data to an Excel table. Converting your data to a table is the first step that you need to take in order to introduce slicers in your spreadsheet. Let's filter our data using the default filtering capabilities that we are already familiar with. Filter the Type column and only select the rows that have the value equal to SUV, as shown in the following figure. Click on the OK button to apply the filter to the table. You will now be left with four rows that have the Type column equal to SUV. Using a typical Excel filter, we were able to filter our data and only show all of the SUV cars. We can then continue to filter by other columns, such as MPG (miles per gallon) and Price. How can we accomplish the same results using slicers? Continue reading this article to find this out. How to create slicers? In this article, we will be going through simple but powerful steps that are required to build slicers. After we create our first slicer, make sure that you compare and contrast the old way of filtering versus the new way of filtering data. Remove the filter that we just applied to our table by clicking on the option named Clear Filter From "Type", as shown in the following figure: With your Excel table selected, click on the TABLE TOOLS tab. Click on the Insert Slicer button. In the Insert Slicers dialog box, select the Type checkbox, and click on the OK button, as shown in the following screenshot: You should now have a slicer that looks similar to the one in the following figure. Notice that you can resize and move the slicer anywhere you want in the spreadsheet. Click on the Sedan filter in the slicer that we build in the previous step. Wow! The data is filtered and only the rows where the Type column is equal to Sedan is shown in the results. Click on the Sport filter and see what happens. The data is now filtered where the Type column is equal to Sport. Notice that the previous filter of Sedan was removed as soon as we clicked on the Sport filter. What if we want to filter the data by both Sport and Sedan? We can just highlight both the filters with our mouse, or click on Sedan, press Ctrl, and then, click on the Sport filter. The end result will look like this: To clear the filter, just click on the Clear Filter button. Do you see the advantage of slicers over filters? Yes, of course, they are simply better. Filtering between Sedan, Sport, or SUV is very easy and convenient. It will certainly take less key strokes and the feedback is instant. Think about the end users interacting with your spreadsheet. At a touch of a button, they can answer questions that arise in their heads. This is what you call an interactive spreadsheet or an interactive dashboard. Styling slicers There are not many options to style slicers but Excel does give you a decent amount of color schemes that you can experiment with: With the Type slicer selected, navigate to the SLICER TOOLS tab, as shown in the following figure: Click on the various slicer styles available to get a feel of what Excel offers. Adding multiple slicers You are able to add multiple slicers and multiple charts in one spreadsheet. Why would we do this? Well, this is the beginning of a dashboard creation. Let's expand on the example we have just been working on, and see how we can turn raw data into an interactive dashboard: Let's start with creating slicers for # of Passengers and MPG, as shown in the following figure: Rename Sheet1 as Data, and create a new sheet called Dashboard, as shown here: Move the three slicers by cutting and pasting them from the Data sheet to the Dashboard sheet. Create a line chart using the columns Company and MPG, as shown in the following figure: Create a bar chart using the columns Type and MPG. Create another bar chart with the columns company and # of Passengers, as shown in the following figure. These types of charts are technically called column charts, but you can get away by calling them bar charts. Now, move the three charts from the Data tab to the Dashboard tab. Right-click on the bar chart, and select the Move Chart… option. In the Move Chart dialog box, change the Object in: parameter from Data to Dashboard, and then click on the OK button. Move the other two charts to the Dashboard tab so that there are no more charts in the Data tab. Rearrange the charts and slicers so that they look as closely as possible to the ones in the following figure. As you can see that this tab is starting to look like a dashboard. The Type slicer will look better if Sedan, Sport, and SUV are laid out horizontally. Select the Type slicer, and click on the SLICER TOOLS menu option. Change the Columns parameter from 1 to 3, as shown in the following figure. This is how we are able to change the layout or shape of the slicer. Resize the Type slicer so that it looks like the one in the following figure: Clearing filters You can click on one or more filters in the dashboard that we just created. Very cool! Every time we select a filter, all of the three charts that we created get updated. This again is called adding interactivity to your spreadsheets. This allows the end users of your dashboard to interact with your data and perform their own analysis. If you notice, there is really a no good way of removing multiple filters at once. For example, if you select Sedans that have a MPG of greater or equal to 30, how would you remove all of the filters? You would have to clear the filters from the Type slicer and then from the MPG slicer. This can be a little tedious to your end user, and you will want to avoid this at any cost. The next steps will show you how to create a button using VBA that will filter all of our data in a flash: Press Alt + F11, and create a sub procedure called Clear_Slicer, as shown in the following figure. This code will basically find all of the filters that you have selected and then manually clears them for you one at a time. The next step is to bind this code to a button: Sub Clear_Slicer() ' Declare Variables Dim cache As SlicerCache ' Loop through each filter For Each cache In ActiveWorkbook.SlicerCaches ' clear filter cache.ClearManualFilter Next cache End Sub Select the DEVELOPER tab, and click on the Insert button. In the pop-up menu called Form Controls, select the Button option. Now, click anywhere on the sheet, and you will get a dialog box that looks like the following figure. This is where we are going to assign a macro to the button. This means that whenever you click on the button we are creating, Excel will run the macro of our choice. Since we have already created a macro called Clear_Slicer, it will make sense to select this macro, and then click on the OK button. Change the text of the button to Clear All Filters and resize it so that it looks like this: Adjust the properties of the button by right-clicking on the button and selecting the Format Control… option. Here, you can change the font size and the color of your button label. Now, select a bunch of filters, and click on our new shiny button. Yes, that was pretty cool. The most important part is that it is now even easier to "reset" your dashboard and start a brand new analysis. What do I mean by start a brand new analysis? In general, when a user initially starts using your dashboard, he/she will click on the filters aimlessly. The users do this just to figure out how to mechanically use the dashboard. Then, after they get the hang of it, they want to start with a clean slate and perform some data analysis. If we did not have the Clear All Filters button, the users would have to figure out how they would clear every slicer one at a time to start over. The worst case scenario is when the user does not realize when the filters are turned on and when they are turned off. Now, do not laugh at this situation, or assume that your end user is not as smart as you are. This just means that you need to lower the learning curve of your dashboard and make it easy to use. With the addition of the clear button, the end user can think of a question, use the slicers to answer it, click on the clear button, and start the process all over again. These little details are what that is going to separate you from the average Excel developer. Summary The aim of this article was to give you ideas and tools to present your data artistically. Whether you like it or not, sometimes, a better looking analysis will trump the better but less attractive one. Excel gives you the tools to not be on the short end of the stick but to always be able to present visually stunning analysis. You now know have your Excel slicers, and you learned how to bind them to your data. Users of your spreadsheet can now slice and dice your data to answer multiple questions. Executives like flashy visualizations, and when you combine them with a strong analysis, you have a very powerful combination. In this article, we also went through a variety of strategies to customize slicers and chart elements. These little changes made to your dashboard will make them standout and help you get your message across. Excel as always has been an invaluable tool that gives you all of the tools necessary to overcome any data challenges you might come across. As I tell all my students, the key to become better is simply to practice, practice, and practice. Resources for Article: Further resources on this subject: Introduction to Stata and Data Analytics [article] Getting Started with Apache Spark DataFrames [article] Big Data Analysis (R and Hadoop) [article]
Read more
  • 0
  • 0
  • 1893

article-image-hand-gesture-recognition-using-kinect-depth-sensor
Packt
06 Oct 2015
26 min read
Save for later

Hand Gesture Recognition Using a Kinect Depth Sensor

Packt
06 Oct 2015
26 min read
In this article by Michael Beyeler author of the book OpenCV with Python Blueprints is to develop an app that detects and tracks simple hand gestures in real time using the output of a depth sensor, such as that of a Microsoft Kinect 3D sensor or an Asus Xtion. The app will analyze each captured frame to perform the following tasks: Hand region segmentation: The user's hand region will be extracted in each frame by analyzing the depth map output of the Kinect sensor, which is done by thresholding, applying some morphological operations, and finding connected components Hand shape analysis: The shape of the segmented hand region will be analyzed by determining contours, convex hull, and convexity defects Hand gesture recognition: The number of extended fingers will be determined based on the hand contour's convexity defects, and the gesture will be classified accordingly (with no extended finger corresponding to a fist, and five extended fingers corresponding to an open hand) Gesture recognition is an ever popular topic in computer science. This is because it not only enables humans to communicate with machines (human-machine interaction or HMI), but also constitutes the first step for machines to begin understanding the human body language. With affordable sensors, such as Microsoft Kinect or Asus Xtion, and open source software such as OpenKinect and OpenNI, it has never been easy to get started in the field yourself. So what shall we do with all this technology? The beauty of the algorithm that we are going to implement in this article is that it works well for a number of hand gestures, yet is simple enough to run in real time on a generic laptop. And if we want, we can easily extend it to incorporate more complicated hand pose estimations. The end product looks like this: No matter how many fingers of my left hand I extend, the algorithm correctly segments the hand region (white), draws the corresponding convex hull (the green line surrounding the hand), finds all convexity defects that belong to the spaces between fingers (large green points) while ignoring others (small red points), and infers the correct number of extended fingers (the number in the bottom-right corner), even for a fist. This article assumes that you have a Microsoft Kinect 3D sensor installed. Alternatively, you may install Asus Xtion or any other depth sensor for which OpenCV has built-in support. First, install OpenKinect and libfreenect from http://www.openkinect.org/wiki/Getting_Started. Then, you need to build (or rebuild) OpenCV with OpenNI support. The GUI used in this article will again be designed with wxPython, which can be obtained from http://www.wxpython.org/download.php. Planning the app The final app will consist of the following modules and scripts: gestures: A module that consists of an algorithm for recognizing hand gestures. We separate this algorithm from the rest of the application so that it can be used as a standalone module without the need for a GUI. gestures.HandGestureRecognition: A class that implements the entire process flow of hand gesture recognition. It accepts a single-channel depth image (acquired from the Kinect depth sensor) and returns an annotated RGB color image with an estimated number of extended fingers. gui: A module that provides a wxPython GUI application to access the capture device and display the video feed. In order to have it access the Kinect depth sensor instead of a generic camera, we will have to extend some of the base class functionality. gui.BaseLayout: A generic layout from which more complicated layouts can be built. Setting up the app Before we can get to the nitty-grittyof our gesture recognition algorithm, we need to make sure that we can access the Kinect sensor and display a stream of depth frames in a simple GUI. Accessing the Kinect 3D sensor Accessing Microsoft Kinect from within OpenCV is not much different from accessing a computer's webcam or camera device. The easiest way to integrate a Kinect sensor with OpenCV is by using an OpenKinect module called freenect. For installation instructions, see the preceding information box. The following code snippet grants access to the sensor using cv2.VideoCapture: import cv2 import freenect device = cv2.cv.CV_CAP_OPENNI capture = cv2.VideoCapture(device) On some platforms, the first call to cv2.VideoCapture fails to open a capture channel. In this case, we provide a workaround by opening the channel ourselves: if not(capture.isOpened(device)): capture.open(device) If you want to connect to your Asus Xtion, the device variable should be assigned the cv2.cv.CV_CAP_OPENNI_ASUS value instead. In order to give our app a fair chance to run in real time, we will limit the frame size to 640 x 480 pixels: capture.set(cv2.cv.CV_CAP_PROP_FRAME_WIDTH, 640) capture.set(cv2.cv.CV_CAP_PROP_FRAME_HEIGHT, 480) If you are using OpenCV 3, the constants you are looking for might be called cv3.CAP_PROP_FRAME_WIDTH and cv3.CAP_PROP_FRAME_HEIGHT. The read() method of cv2.VideoCapture is inappropriate when we need to synchronize a set of cameras or a multihead camera, such as a Kinect. In this case, we should use the grab() and retrieve() methods instead. An even easier way when working with OpenKinect is to use the sync_get_depth() and sync_get_video()methods. For the purpose of this article, we will need only the Kinect's depth map, which is a single-channel (grayscale) image in which each pixel value is the estimated distance from the camera to a particular surface in the visual scene. The latest frame can be grabbed via this code: depth, timestamp = freenect.sync_get_depth() The preceding code returns both the depth map and a timestamp. We will ignore the latter for now. By default, the map is in 11-bit format, which is inadequate to be visualized with cv2.imshow right away. Thus, it is a good idea to convert the image to 8-bit precision first. In order to reduce the range of depth values in the frame, we will clip the maximal distance to a value of 1,023 (or 2**10-1). This will get rid of values that correspond either to noise or distances that are far too large to be of interest to us: np.clip(depth, 0, 2**10-1, depth) depth >>= 2 Then, we convert the image into 8-bit format and display it: depth = depth.astype(np.uint8) cv2.imshow("depth", depth) Running the app In order to run our app, we will need to execute a main function routine that accesses the Kinect, generates the GUI, and executes the main loop of the app: import numpy as np import wx import cv2 import freenect from gui import BaseLayout from gestures import HandGestureRecognition def main(): device = cv2.cv.CV_CAP_OPENNI capture = cv2.VideoCapture() if not(capture.isOpened()): capture.open(device) capture.set(cv2.cv.CV_CAP_PROP_FRAME_WIDTH, 640) capture.set(cv2.cv.CV_CAP_PROP_FRAME_HEIGHT, 480) We will design a suitable layout (KinectLayout) for the current project: # start graphical user interface app = wx.App() layout = KinectLayout(None, -1, 'Kinect Hand Gesture Recognition', capture) layout.Show(True) app.MainLoop() The Kinect GUI The layout chosen for the current project (KinectLayout) is as plain as it gets. It should simply display the live stream of the Kinect depth sensor at a comfortable frame rate of 10 frames per second. Therefore, there is no need to further customize BaseLayout: class KinectLayout(BaseLayout): def _create_custom_layout(self): pass The only parameter that needs to be initialized this time is the recognition class. This will be useful in just a moment: def _init_custom_layout(self): self.hand_gestures = HandGestureRecognition() Instead of reading a regular camera frame, we need to acquire a depth frame via the freenect method sync_get_depth(). This can be achieved by overriding the following method: def _acquire_frame(self): As mentioned earlier, by default, this function returns a single-channel depth image with 11-bit precision and a timestamp. However, we are not interested in the timestamp, and we simply pass on the frame if the acquisition was successful: frame, _ = freenect.sync_get_depth() # return success if frame size is valid if frame is not None: return (True, frame) else: return (False, frame) The rest of the visualization pipeline is handled by the BaseLayout class. We only need to make sure that we provide a _process_frame method. This method accepts a depth image with 11-bit precision, processes it, and returns an annotated 8-bit RGB color image. Conversion to a regular grayscale image is the same as mentioned in the previous subsection: def _process_frame(self, frame): # clip max depth to 1023, convert to 8-bit grayscale np.clip(frame, 0, 2**10 – 1, frame) frame >>= 2 frame = frame.astype(np.uint8) The resulting grayscale image can then be passed to the hand gesture recognizer, which will return the estimated number of extended fingers (num_fingers) and the annotated RGB color image mentioned earlier (img_draw): num_fingers, img_draw = self.hand_gestures.recognize(frame) In order to simplify the segmentation task of the HandGestureRecognition class, we will instruct the user to place their hand in the center of the screen. To provide a visual aid for this, let's draw a rectangle around the image center and highlight the center pixel of the image in orange: height, width = frame.shape[:2] cv2.circle(img_draw, (width/2, height/2), 3, [255, 102, 0], 2) cv2.rectangle(img_draw, (width/3, height/3), (width*2/3, height*2/3), [255, 102, 0], 2) In addition, we print num_fingers on the screen: cv2.putText(img_draw, str(num_fingers), (30, 30),cv2.FONT_HERSHEY_SIMPLEX, 1, (255, 255, 255)) return img_draw Tracking hand gestures in real time The bulk of the work is done by the HandGestureRecognition class, especially by its recognize method. This class starts off with a few parameter initializations, which will be explained and used later: class HandGestureRecognition: def __init__(self): # maximum depth deviation for a pixel to be considered # within range self.abs_depth_dev = 14 # cut-off angle (deg): everything below this is a convexity # point that belongs to two extended fingers self.thresh_deg = 80.0 The recognize method is where the real magic takes place. This method handles the entire process flow, from the raw grayscale image all the way to a recognized hand gesture. It implements the following procedure: It extracts the user's hand region by analyzing the depth map (img_gray) and returning a hand region mask (segment): def recognize(self, img_gray): segment = self._segment_arm(img_gray) It performs contour analysis on the hand region mask (segment). Then, it returns the largest contour area found in the image (contours) and any convexity defects (defects): [contours, defects] = self._find_hull_defects(segment) Based on the contours found and the convexity defects, it detects the number of extended fingers (num_fingers) in the image. Then, it annotates the output image (img_draw) with contours, defect points, and the number of extended fingers: img_draw = cv2.cvtColor(img_gray, cv2.COLOR_GRAY2RGB) [num_fingers, img_draw] = self._detect_num_fingers(contours, defects, img_draw) It returns the estimated number of extended fingers (num_fingers) as well as the annotated output image (img_draw): return (num_fingers, img_draw) Hand region segmentation The automatic detection of an arm, and later the hand region, could be designed to be arbitrarily complicated, maybe by combining information about the shape and color of an arm or hand. However, using a skin color as a determining feature to find hands in visual scenes might fail terribly in poor lighting conditions or when the user is wearing gloves. Instead, we choose to recognize the user's hand by its shape in the depth map. Allowing hands of all sorts to be present in any region of the image unnecessarily complicates the mission of this article, so we make two simplifying assumptions: We will instruct the user of our app to place their hand in front of the center of the screen, orienting their palm roughly parallel to the orientation of the Kinect sensor so that it is easier to identify the corresponding depth layer of the hand. We will also instruct the user to sit roughly 1 to 2 meters away from the Kinect, and to slightly extend their arm in front of their body so that the hand will end up in a slightly different depth layer than the arm. However, the algorithm will still work even if the full arm is visible. In this way, it will be relatively straightforward to segment the image based on the depth layer alone. Otherwise, we would have to come up with a hand detection algorithm first, which would unnecessarily complicate our mission. If you feel adventurous, feel free to do this on your own. Finding the most prominent depth of the image center region Once the hand is placed roughly in the center of the screen, we can start finding all image pixels that lie on the same depth plane as the hand. For this, we simply need to determine the most prominent depth value of the center region of the image. The simplest approach would be as follows: look only at the depth value of the center pixel: width, height = depth.shape center_pixel_depth = depth[width/2, height/2] Then, create a mask in which all pixels at a depth of center_pixel_depth are white and all others are black: import numpy as np depth_mask = np.where(depth == center_pixel_depth, 255, 0).astype(np.uint8) However, this approach will not be very robust, because chances are that: Your hand is not placed perfectly parallel to the Kinect sensor Your hand is not perfectly flat The Kinect sensor values are noisy Therefore, different regions of your hand will have slightly different depth values. The _segment_arm method takes a slightly better approach, that is, looking at a small neighborhood in the center of the image and determining the median (meaning the most prominent) depth value. First, we find the center (for example, 21 x 21 pixels) region of the image frame: def _segment_arm(self, frame): """ segments the arm region based on depth """ center_half = 10 # half-width of 21 is 21/2-1 lowerHeight = self.height/2 – center_half upperHeight = self.height/2 + center_half lowerWidth = self.width/2 – center_half upperWidth = self.width/2 + center_half center = frame[lowerHeight:upperHeight,lowerWidth:upperWidth] We can then reshape the depth values of this center region into a one-dimensional vector and determine the median depth value, med_val: med_val = np.median(center) We can now compare med_val with the depth value of all pixels in the image and create a mask in which all pixels whose depth values are within a particular range [med_val-self.abs_depth_dev, med_val+self.abs_depth_dev] are white and all other pixels are black. However, for reasons that will be clear in a moment, let's paint the pixels gray instead of white: frame = np.where(abs(frame – med_val) <= self.abs_depth_dev, 128, 0).astype(np.uint8) The result will look like this: Applying morphological closing to smoothen the segmentation mask A common problem with segmentation is that a hard threshold typically results in small imperfections (that is, holes, as in the preceding image) in the segmented region. These holes can be alleviated using morphological opening and closing. Opening removes small objects from the foreground (assuming that the objects are bright on a dark foreground), whereas closing removes small holes (dark regions). This means that we can get rid of the small black regions in our mask by applying morphological closing (dilation followed by erosion) with a small 3 x 3 pixel kernel: kernel = np.ones((3, 3), np.uint8) frame = cv2.morphologyEx(frame, cv2.MORPH_CLOSE, kernel) The result looks a lot smoother, as follows: Notice, however, that the mask still contains regions that do not belong to the hand or arm, such as what appears to be one of my knees on the left and some furniture on the right. These objects just happen to be on the same depth layer as my arm and hand. If possible, we could now combine the depth information with another descriptor, maybe a texture-based or skeleton-based hand classifier, that would weed out all non-skin regions. Finding connected components in a segmentation mask An easier approach is to realize that most of the times, hands are not connected to knees or furniture. We already know that the center region belongs to the hand, so we can simply apply cv2.floodfill to find all the connected image regions. Before we do this, we want to be absolutely certain that the seed point for the flood fill belongs to the right mask region. This can be achieved by assigning a grayscale value of 128 to the seed point. But we also want to make sure that the center pixel does not, by any coincidence, lie within a cavity that the morphological operation failed to close. So, let's set a small 7 x 7 pixel region with a grayscale value of 128 instead: small_kernel = 3 frame[self.height/2-small_kernel : self.height/2+small_kernel, self.width/2-small_kernel : self.width/2+small_kernel] = 128 Because flood filling (as well as morphological operations) is potentially dangerous, the Python version of later OpenCV versions requires specifying a mask that avoids flooding the entire image. This mask has to be 2 pixels wider and taller than the original image and has to be used in combination with the cv2.FLOODFILL_MASK_ONLY flag. It can be very helpful in constraining the flood filling to a small region of the image or a specific contour so that we need not connect two neighboring regions that should have never been connected in the first place. It's better to be safe than sorry, right? Ah, screw it! Today, we feel courageous! Let's make the mask entirely black: mask = np.zeros((self.height+2, self.width+2), np.uint8) Then we can apply the flood fill to the center pixel (seed point) and paint all the connected regions white: flood = frame.copy() cv2.floodFill(flood, mask, (self.width/2, self.height/2), 255, flags=4 | (255 << 8)) At this point, it should be clear why we decided to start with a gray mask earlier. We now have a mask that contains white regions (arm and hand), gray regions (neither arm nor hand but other things in the same depth plane), and black regions (all others). With this setup, it is easy to apply a simple binary threshold to highlight only the relevant regions of the pre-segmented depth plane: ret, flooded = cv2.threshold(flood, 129, 255, cv2.THRESH_BINARY) This is what the resulting mask looks like: The resulting segmentation mask can now be returned to the recognize method, where it will be used as an input to _find_hull_defects as well as a canvas for drawing the final output image (img_draw). Hand shape analysis Now that we (roughly) know where the hand is located, we aim to learn something about its shape. Determining the contour of the segmented hand region The first step involves determining the contour of the segmented hand region. Luckily, OpenCV comes with a pre-canned version of such an algorithm—cv2.findContours. This function acts on a binary image and returns a set of points that are believed to be part of the contour. Because there might be multiple contours present in the image, it is possible to retrieve an entire hierarchy of contours: def _find_hull_defects(self, segment): contours, hierarchy = cv2.findContours(segment, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE) Furthermore, because we do not know which contour we are looking for, we have to make an assumption to clean up the contour result. Since it is possible that some small cavities are left over even after the morphological closing—but we are fairly certain that our mask contains only the segmented area of interest—we will assume that the largest contour found is the one that we are looking for. Thus, we simply traverse the list of contours, calculate the contour area (cv2.contourArea), and store only the largest one (max_contour): max_contour = max(contours, key=cv2.contourArea) Finding the convex hull of a contour area Once we have identified the largest contour in our mask, it is straightforward to compute the convex hull of the contour area. The convex hull is basically the envelope of the contour area. If you think of all the pixels that belong to the contour area as a set of nails sticking out of a board, then the convex hull is the shape formed by a tight rubber band that surrounds all the nails. We can get the convex hull directly from our largest contour (max_contour): hull = cv2.convexHull(max_contour, returnPoints=False) Because we now want to look at convexity deficits in this hull, we are instructed by the OpenCV documentation to set the returnPoints optional flag to False. The convex hull drawn in green around a segmented hand region looks like this: Finding convexity defects of a convex hull As is evident from the preceding screenshot, not all points on the convex hull belong to the segmented hand region. In fact, all the fingers and the wrist cause severe convexity defects, that is, points of the contour that are far away from the hull. We can find these defects by looking at both the largest contour (max_contour) and the corresponding convex hull (hull): defects = cv2.convexityDefects(max_contour, hull) The output of this function (defects) is a 4-tuple that contains start_index (the point of the contour where the defect begins), end_index (the point of the contour where the defect ends), farthest_pt_index (the farthest from the convex hull point within the defect), and fixpt_depth (distance between the farthest point and the convex hull). We will make use of this information in just a moment when we reason about fingers. But for now, our job is done. The extracted contour (max_contour) and convexity defects (defects) can be passed to recognize, where they will be used as inputs to _detect_num_fingers: return (cnt,defects) Hand gesture recognition What remains to be done is classifying the hand gesture based on the number of extended fingers. For example, if we find five extended fingers, we assume the hand to be open, whereas no extended fingers imply a fist. All that we are trying to do is count from zero to five and make the app recognize the corresponding number of fingers. This is actually trickier than it might seem at first. For example, people in Europe might count to three by extending their thumb, index finger, and middle finger. If you do that in the US, people there might get horrendously confused, because people do not tend to use their thumbs when signaling the number two. This might lead to frustration, especially in restaurants (trust me). If we could find a way to generalize these two scenarios—maybe by appropriately counting the number of extended fingers—we would have an algorithm that could teach simple hand gesture recognition to not only a machine but also (maybe) to an average waitress. As you might have guessed, the answer has to do with convexity defects. As mentioned earlier, extended fingers cause defects in the convex hull. However, the inverse is not true; that is, not all convexity defects are caused by fingers! There might be additional defects caused by the wrist as well as the overall orientation of the hand or the arm. How can we distinguish between these different causes for defects? Distinguishing between different causes for convexity defects The trick is to look at the angle between the farthest point from the convex hull point within the defect (farthest_pt_index) and the start and end points of the defect (start_index and end_index, respectively), as illustrated in the following screenshot: In this screenshot, the orange markers serve as a visual aid to center the hand in the middle of the screen, and the convex hull is outlined in green. Each red dot corresponds to a farthest from the convex hull point (farthest_pt_index) for every convexity defect detected. If we compare a typical angle that belongs to two extended fingers (such as θj) to an angle that is caused by general hand geometry (such as θi), we notice that the former is much smaller than the latter. This is obviously because humans can spread their finger only a little, thus creating a narrow angle made by the farthest defect point and the neighboring fingertips. Therefore, we can iterate over all convexity defects and compute the angle between the said points. For this, we will need a utility function that calculates the angle (in radians) between two arbitrary, list-like vectors, v1 and v2: def angle_rad(v1, v2): return np.arctan2(np.linalg.norm(np.cross(v1, v2)), np.dot(v1, v2)) This method uses the cross product to compute the angle, rather than the standard way. The standard way of calculating the angle between two vectors v1 and v2 is by calculating their dot product and dividing it by the norm of v1 and the norm of v2. However, this method has two imperfections: You have to manually avoid division by zero if either the norm of v1 or the norm of v2 is zero The method returns relatively inaccurate results for small angles Similarly, we provide a simple function to convert an angle from degrees to radians: def deg2rad(angle_deg): return angle_deg/180.0*np.pi Classifying hand gestures based on the number of extended fingers What remains to be done is actually classifying the hand gesture based on the number of extended fingers. The _detect_num_fingers method will take as input the detected contour (contours), the convexity defects (defects), and a canvas to draw on (img_draw): def _detect_num_fingers(self, contours, defects, img_draw): Based on these parameters, it will then determine the number of extended fingers. However, we first need to define a cut-off angle that can be used as a threshold to classify convexity defects as being caused by extended fingers or not. Except for the angle between the thumb and the index finger, it is rather hard to get anything close to 90 degrees, so anything close to that number should work. We do not want the cut-off angle to be too high, because that might lead to misclassifications: self.thresh_deg = 80.0 For simplicity, let's focus on the special cases first. If we do not find any convexity defects, it means that we possibly made a mistake during the convex hull calculation, or there are simply no extended fingers in the frame, so we return 0 as the number of detected fingers: if defects is None: return [0, img_draw] But we can take this idea even further. Due to the fact that arms are usually slimmer than hands or fists, we can assume that the hand geometry will always generate at least two convexity defects (which usually belong to the wrists). So if there are no additional defects, it implies that there are no extended fingers: if len(defects) <= 2: return [0, img_draw] Now that we have ruled out all special cases, we can begin counting real fingers. If there are a sufficient number of defects, we will find a defect between every pair of fingers. Thus, in order to get the number right (num_fingers), we should start counting at 1: num_fingers = 1 Then we can start iterating over all convexity defects. For each defect, we will extract the four elements and draw its hull for visualization purposes: for i in range(defects.shape[0]): # each defect point is a 4-tuplestart_idx, end_idx, farthest_idx, _ == defects[i, 0] start = tuple(contours[start_idx][0]) end = tuple(contours[end_idx][0]) far = tuple(contours[farthest_idx][0]) # draw the hull cv2.line(img_draw, start, end [0, 255, 0], 2) Then we will compute the angle between the two edges from far to start and from far to end. If the angle is smaller than self.thresh_deg degrees, it means that we are dealing with a defect that is most likely caused by two extended fingers. In this case, we want to increment the number of detected fingers (num_fingers), and we draw the point with green. Otherwise, we draw the point with red: # if angle is below a threshold, defect point belongs # to two extended fingers if angle_rad(np.subtract(start, far), np.subtract(end, far)) < deg2rad(self.thresh_deg): # increment number of fingers num_fingers = num_fingers + 1 # draw point as green cv2.circle(img_draw, far, 5, [0, 255, 0], -1) else: # draw point as red cv2.circle(img_draw, far, 5, [255, 0, 0], -1) After iterating over all convexity defects, we pass the number of detected fingers and the assembled output image to the recognize method: return (min(5, num_fingers), img_draw) This will make sure that we do not exceed the common number of fingers per hand. The result can be seen in the following screenshots: Interestingly, our app is able to detect the correct number of extended fingers in a variety of hand configurations. Defect points between extended fingers are easily classified as such by the algorithm, and others are successfully ignored. Summary This article showed a relatively simple and yet surprisingly robust way of recognizing a variety of hand gestures by counting the number of extended fingers. The algorithm first shows how a task-relevant region of the image can be segmented using depth information acquired from a Microsoft Kinect 3D Sensor, and how morphological operations can be used to clean up the segmentation result. By analyzing the shape of the segmented hand region, the algorithm comes up with a way to classify hand gestures based on the types of convexity effects found in the image. Once again, mastering our use of OpenCV to perform a desired task did not require us to produce a large amount of code. Instead, we were challenged to gain an important insight that made us use the built-in functionality of OpenCV in the most effective way possible. Gesture recognition is a popular but challenging field in computer science, with applications in a large number of areas, such as human-computer interaction, video surveillance, and even the video game industry. You can now use your advanced understanding of segmentation and structure analysis to build your own state-of-the-art gesture recognition system. Resources for Article: Tracking Faces with Haar Cascades Our First Machine Learning Method - Linear Classification Solving problems with Python: Closest good restaurant
Read more
  • 0
  • 0
  • 13777