Data | 0 articles | Tech News, Tutorials & Expert Insights

21 Oct 2015

18 min read

Configuring Brokers

21 Oct 2015

0
0
2008

article-image-understanding-text-search-and-hierarchies-sap-hana

Packt

20 Oct 2015

9 min read

Understanding Text Search and Hierarchies in SAP HANA

Packt

20 Oct 2015

9 min read

0
0
5729

Packt

20 Oct 2015

6 min read

QlikView Tips and Tricks

Packt

20 Oct 2015

6 min read

In this article by Andrew Dove and Roger Stone, author of the book QlikView Unlocked, we will cover the following key topics: A few coding tips The surprising data sources Include files Change logs (For more resources related to this topic, see here.) A few coding tips There are many ways to improve things in QlikView. Some are techniques and others are simply useful things to know or do. Here are a few of our favourite ones. Keep the coding style constant There's actually more to this than just being a tidy developer. So, always code your function names in the same way—it doesn't matter which style you use (unless you have installation standards that require a particular style). For example, you could use MonthStart(), monthstart(), or MONTHSTART(). They're all equally valid, but for consistency, choose one and stick to it. Use MUST_INCLUDE rather than INCLUDE This feature wasn't documented at all until quite a late service release of v11.2; however, it's very useful. If you use INCLUDE and the file you're trying to include can't be found, QlikView will silently ignore it. The consequences of this are unpredictable, ranging from strange behaviour to an outright script failure. If you use MUST_INCLUDE, QlikView will complain that the included file is missing, and you can fix the problem before it causes other issues. Actually, it seems strange that INCLUDE doesn't do this, but Qlik must have its reasons. Nevertheless, always use MUST_INCLUDE to save yourself some time and effort. Put version numbers in your code QlikView doesn't have a versioning system as such, and we have yet to see one that works effectively with QlikView. So, this requires some effort on the part of the developer. Devise a versioning system and always place the version number in a variable that is displayed somewhere in the application. Updating this number every time you make a change doesn't matter, but ensure that it's updated for every release to the user and ties in with your own release logs. Do stringing in the script and not in screen objects We would have put this in anyway, but its place in the article was assured by a recent experience on a user site. They wanted four lines of address and a postcode strung together in a single field, with each part separated by a comma and a space. However, any field could contain nulls; so, to avoid addresses such as ',,,,' or ', Somewhere ,,,', there had be a check for null in every field as the fields were strung together. The table only contained about 350 rows, but it took 56 seconds to refresh on screen when the work was done in an expression in a straight table. Moving the expression to the script and presenting just the resulting single field on screen took only 0.14 seconds. (That's right; it's about a seventh of a second). Plus, it didn't adversely affect script performance. We can't think of a better example of improving screen performance. The surprising data sources QlikView will read database tables, spreadsheets, XML files, and text files, but did you know that it can also take data from a web page? If you need some standard data from the Internet, there's no need to create your own version. Just grab it from a web page! How about ISO Country Codes? Here's an example. Open the script and click on Web files… below Data from Filesto the right of the bottom section of the screen. This will open the File Wizard: Source dialogue, as in the following screenshot. Enter the URL where the table of data resides: Then, click on Next and in this case, select @2 under Tables, as shown in the following screenshot: Click on Finish and your script will look something similar to this: LOAD F1, Country, A2, A3, Number FROM [http://www.airlineupdate.com/content_public/codes/misc_codes/icao _nat.htm] (html, codepage is 1252, embedded labels, table is @2); Now, you've got a great lookup table in about 30 seconds; it will take another few seconds to clean it up for your own purposes. One small caveat though—web pages can change address, content, and structure, so it's worth putting in some validation around this if you think there could be any volatility. Include files We have already said that you should use MUST_INCLUDE rather than INCLUDE, but we're always surprised that many developers never use include files at all. If the same code needs to be used in more than one place, it really should be in an include file. Suppose that you have several documents that use C:QlikFilesFinanceBudgets.xlsx and that the folder name is hard coded in all of them. As soon as the file is moved to another location, you will have several modifications to make, and it's easy to miss changing a document because you may not even realise it uses the file. The solution is simple, very effective, and guaranteed to save you many reload failures. Instead of coding the full folder name, create something similar to this: LET vBudgetFolder='C:QlikFilesFinance'; Put the line into an include file, for instance, FolderNames.inc. Then, code this into each script as follows: $(MUST_INCLUDE=FolderNames.inc) Finally, when you want to refer to your Budgets.xlsx spreadsheet, code this: $(vBudgetFolder)Budgets.xlsx Now, if the folder path has to change, you only need to change one line of code in the include file, and everything will work fine as long as you implement include files in all your documents. Note that this works just as well for folders containing QVD files and so on. You can also use this technique to include LOAD from QVDs or spreadsheets because you should always aim to have just one version of the truth. Change logs Unfortunately, one of the things QlikView is not great at is version control. It can be really hard to see what has been done between versions of a document, and using the -prj folder feature can be extremely tedious and not necessarily helpful. So, this means that you, as the developer, need to maintain some discipline over version control. To do this, ensure that you have an area of comments that looks something similar to this right at the top of your script: // Demo.qvw // // Roger Stone - One QV Ltd - 04-Jul-2015 // // PURPOSE // Sample code for QlikView Unlocked - Chapter 6 // // CHANGE LOG // Initial version 0.1 // - Pull in ISO table from Internet and local Excel data // // Version 0.2 // Remove unused fields and rename incoming ISO table fields to // match local spreadsheet // Ensure that you update this every time you make a change. You could make this even more helpful by explaining why the change was made and not just what change was made. You should also comment the expressions in charts when they are changed. Summary In this article, we covered few coding tips, the surprising data sources, include files, and change logs. Resources for Article: Further resources on this subject: Qlik Sense's Vision [Article] Securing QlikView Documents [Article] Common QlikView script errors [Article]

0
0
2741

Packt

19 Oct 2015

8 min read

SQL Server with PowerShell

Packt

19 Oct 2015

8 min read

In this article by Donabel Santos, author of the book, SQL Server 2014 with Powershell v5 Cookbook explains scripts and snippets of code that accomplish basic SQL Server tasks using PowerShell. She discusses simple tasks such as Listing SQL Server Instances and Discovering SQL Server Services to make you comfortable working with SQL Server programmatically. However, even if ever you explore how to create some common database objects using PowerShell, keep in mind that PowerShell will not always be the best tool for the task. There will be tasks that are best completed using T-SQL. It is still good to know what is possible in PowerShell and how to do them, so you know that you have alternatives depending on your requirements or situation. For the recipes, we are going to use PowerShell ISE quite a lot. If you prefer running the script from the PowerShell console rather run running the commands from the ISE, you can save the scripts in a .ps1 file and run it from the PowerShell console. (For more resources related to this topic, see here.) Listing SQL Server Instances In this recipe, we will list all SQL Server Instances in the local network. Getting ready Log in to the server that has your SQL Server development instance as an administrator. How to do it... Let's look at the steps to list your SQL Server instances: Open PowerShell ISE as administrator. Let's use the Start-Service cmdlet to start the SQL Browser service: Import-Module SQLPS -DisableNameChecking #out of the box, the SQLBrowser is disabled. To enable: Set-Service SQLBrowser -StartupType Automatic #sql browser must be installed and running for us #to discover SQL Server instances Start-Service "SQLBrowser" Next, you need to create a ManagedComputer object to get access to instances. Type the following script and run: $instanceName = "localhost" $managedComputer = New-Object Microsoft.SqlServer.Management.Smo.Wmi.ManagedComputer $instanceName #list server instances $managedComputer.ServerInstances Your result should look similar to the one shown in the following screenshot: Notice that $managedComputer.ServerInstances gives you not only instance names, but also additional properties such as ServerProtocols, Urn, State, and so on. Confirm that these are the same instances you see from SQL Server Management Studio. Open SQL Server Management Studio. Go to Connect | Database Engine. In the Server Name dropdown, click on Browse for More. Select the Network Servers tab and check the instances listed. Your screen should look similar to this: How it works... All services in a Windows operating system are exposed and accessible using Windows Management Instrumentation (WMI). WMI is Microsoft's framework for listing, setting, and configuring any Microsoft-related resource. This framework follows Web-based Enterprise Management (WBEM). The DISTRIBUTED MANAGEMENT TASK FORCE, INC. (http://www.dmtf.org/standards/wbem) defines WBEM as follows: A set of management and Internet standard technologies developed to unify the management of distributed computing environments. WBEM provides the ability for the industry to deliver a well-integrated set of standard-based management tools, facilitating the exchange of data across otherwise disparate technologies and platforms. In order to access SQL Server WMI-related objects, you can create a WMI ManagedComputer instance: $managedComputer = New-Object Microsoft.SqlServer.Management.Smo.Wmi.ManagedComputer $instanceName The ManagedComputer object has access to a ServerInstance property, which in turn lists all available instances in the local network. These instances however are only identifiable if the SQL Server Browser service is running. The SQL Server Browser is a Windows Service that can provide information on installed instances in a box. You need to start this service if you want to list the SQL Server-related services. There's more... The Services instance of the ManagedComputer object can also provide similar information, but you will have to filter for the server type SqlServer: #list server instances $managedComputer.Services | Where-Object Type –eq "SqlServer" | Select-Object Name, State, Type, StartMode, ProcessId Your result should look like this: Instead of creating a WMI instance by using the New-Object method, you can also use the Get-WmiObject cmdlet when creating your variable. Get-WmiObject, however, will not expose exactly the same properties exposed by the Microsoft.SqlServer.Management.Smo.Wmi.ManagedComputer object. To list instances using Get-WmiObject, you will need to discover what namespace is available in your environment: $hostName = "localhost" $namespace = Get-WMIObject -ComputerName $hostName -Namespace rootMicrosoftSQLServer -Class "__NAMESPACE" | Where-Object Name -like "ComputerManagement*" #see matching namespace objects $namespace #see namespace names $namespace | Select-Object -ExpandProperty "__NAMESPACE" $namespace | Select-Object -ExpandProperty "Name" If you are using PowerShell v2, you will have to change the Where-Object cmdlet usage to use the curly braces {} and the $_ variable: Where-Object {$_.Name -like "ComputerManagement*" } For SQL Server 2014, the namespace value is: ROOTMicrosoftSQLServerComputerManagement12 This value can be derived from $namespace.__NAMESPACE and $namespace.Name. Once you have the namespace, you can use this with Get-WmiObject to retrieve the instances. We can use the SqlServiceType property to filter. According to MSDN (http://msdn.microsoft.com/en-us/library/ms179591.aspx), these are the values of SqlServiceType: SqlServiceType Description 1 SQL Server Service 2 SQL Server Agent Service 3 Full-Text Search Engine Service 4 Integration Services Service 5 Analysis Services Service 6 Reporting Services Service 7 SQL Browser Service Thus, to retrieve the SQL Server instances, we need to provide the full namespace ROOTMicrosoftSQLServerComputerManagement12. We also need to filter for SQL Server Service type, or SQLServiceType = 1. The code is as follows: Get-WmiObject -ComputerName $hostName -Namespace "$($namespace.__NAMESPACE)$($namespace.Name)" -Class SqlService | Where-Object SQLServiceType -eq 1 | Select-Object ServiceName, DisplayName, SQLServiceType | Format-Table –AutoSize Your result should look similar to the following screenshot: Yet another way to list all the SQL Server instances in the local network is by using the System.Data.Sql.SQLSourceEnumerator class, instead of ManagedComputer. This class has a static method called Instance.GetDataSources that will list all SQL Server instances: [System.Data.Sql.SqlDataSourceEnumerator]: :Instance.GetDataSources() | Format-Table -AutoSize When you execute, your result should look similar to the following: If you have multiple SQL Server versions, you can use the following code to display your instances: #list services using WMI foreach ($path in $namespace) { Write-Verbose "SQL Services in:$($path.__NAMESPACE)$($path.Name)" Get-WmiObject -ComputerName $hostName ` -Namespace "$($path.__NAMESPACE)$($path.Name)" ` -Class SqlService | Where-Object SQLServiceType -eq 1 | Select-Object ServiceName, DisplayName, SQLServiceType | Format-Table –AutoSize } Discovering SQL Server Services In this recipe, we will enumerate all SQL Server Services and list their statuses. Getting ready Check which SQL Server services are installed in your instance. Go to Start | Run and type services.msc. You should see a screen similar to this: How to do it... Let's assume you are running this script on the server box: Open PowerShell ISE as administrator. Add the following code and execute: Import-Module SQLPS -DisableNameChecking #you can replace localhost with your instance name $instanceName = "localhost" $managedComputer = New-Object Microsoft.SqlServer.Management.Smo.Wmi.ManagedComputer $instanceName #list services $managedComputer.Services | Select-Object Name, Type, ServiceState, DisplayName | Format-Table -AutoSize Your result will look similar to the one shown in the following screenshot: Items listed in your screen will vary depending on the features installed and running in your instance Confirm that these are the services that exist in your server. Check your services window. How it works... Services that are installed on a system can be queried using WMI. Specific services for SQL Server are exposed through SMO's WMI ManagedComputer object. Some of the exposed properties are as follows: ClientProtocols ConnectionSettings ServerAliases ServerInstances Services There's more... An alternative way to get SQL Server-related services is by using Get-WMIObject. We will need to pass in the host name as well as the SQL Server WMI Provider for the ComputerManagement namespace. For SQL Server 2014, this value is ROOTMicrosoftSQLServerComputerManagement12. The script to retrieve the services is provided here. Note that we are dynamically composing the WMI namespace. The code is as follows: $hostName = "localhost" $namespace = Get-WMIObject -ComputerName $hostName -NameSpace rootMicrosoftSQLServer -Class "__NAMESPACE" | Where-Object Name -like "ComputerManagement*" Get-WmiObject -ComputerName $hostname -Namespace "$($namespace.__NAMESPACE)$($namespace.Name)" -Class SqlService | Select-Object ServiceName If you have multiple SQL Server versions installed and want to see just the most recent version's services, you can limit to the latest namespace by adding Select-Object –Last 1: $namespace = Get-WMIObject -ComputerName $hostName -NameSpace rootMicrosoftSQLServer -Class "__NAMESPACE" | Where-Object Name -like "ComputerManagement*" | Select-Object –Last 1 Yet another alternative but less accurate way of listing possible SQL Server related services is the following snippet of code: #alterative - but less accurate Get-Service *SQL* This uses the Get-Service cmdlet and filters base on the service name. This is less accurate because this grabs all processes that have SQL in the name, but may not necessarily be related to SQL Server. For example, if you have MySQL installed, it will get picked up as a process. Conversely, this will not pick up SQL Server-related services that do not have SQL in the name, such as ReportServer. Summary You will find that many of the scripts can be accomplished using PowerShell and SQL Management Objects (SMO). SMO is a library that exposes SQL Server classes that allow programmatic manipulation and automation of many database tasks. For some , we will also explore alternative ways of accomplishing the same tasks using different native PowerShell cmdlets. Now that we have a gist of SQL Server 2014 with PowerShell, lets build a full-fledged e-commerce project with SQL Server 2014 with Powershell v5 Cookbook. Resources for Article: Further resources on this subject: Exploring Windows PowerShell 5.0 [article] Working with PowerShell [article] Installing/upgrading PowerShell [article]

0
0
2512

Packt

19 Oct 2015

5 min read

An Overview of Oozie

Packt

19 Oct 2015

5 min read

In this article by Jagat Singh, the author of the book Apache Oozie Essentials, we will see a basic overview of Oozie and its concepts in brief. (For more resources related to this topic, see here.) Concepts Oozie is a workflow scheduler system to run Apache Hadoop jobs. Oozie workflow jobs are Directed Acyclic Graphs (DAGs) (https://en.wikipedia.org/wiki/Directed_acyclic_graph) representation of actions. Actions tell what to do in the job. Oozie supports running jobs of various types such as Java, Map-reduce, Pig, Hive, Sqoop, Spark, and Distcp. The output of one action can be consumed by the next action to create chain sequence. Oozie has client server architecture, in which we install the server for storing the jobs and using client we submit our jobs to the server. Let's get an idea of few basic concepts of Oozie. Workflow Workflow tells Oozie 'what' to do. It is a collection of actions arranged in required dependency graph. So as part of workflows definition we write some actions and call them in certain order. These are of various types for tasks, which we can do as part of workflow for example, Hadoop filesystem action, Pig action, Hive action, Mapreduce action , Spark action, and so on. Coordinator Coordinator tells Oozie 'when' to do. Coordinators let us to run inter-dependent workflows as data pipelines based on some starting criteria. Most of the Oozie jobs are triggered at given scheduled time interval or when input dataset is present for triggering the job. Following are important definitions related to coordinators: Nominal time: The scheduled time at which job should execute. Example, we process pressrelease every day at 8:00PM. Actual time: The real time when the job ran. In some cases if the input data does not arrive the job might start late. This type of data dependent job triggering is indicated by done-flag (more on this later). The done-flag gives signal to start the job execution. The general skeleton template of coordinator is shown in the following figure: Bundles Bundles tell Oozie which all things to do together as a group. For example a set of coordinators, which can be run together to satisfy a given business requirement can be combined as Bundle. Book case study One of the main used cases of Hadoop is ETL data processing. Suppose that we work for a large consulting company and have won project to setup Big data cluster inside customer data center. On high level the requirements are to setup environment that will satisfy the following flow: We get data from various sources in Hadoop (File based loads, Sqoop based loads) We preprocess them with various scripts (Pig, Hive, Mapreduce) Insert that data into Hive tables for use by analyst and data scientists Data scientists write machine learning models (Spark) We will be using Oozie as our processing scheduling system to do all the above. In our architecture we have one landing server, which sits outside as front door of the cluster. All source systems send files to us via scp and we regularly (for example, nightly to keep simple) push them to HDFS using the hadoop fs -copyFromLocal command. This script is cron driven. It has very simple business logic run every night at 8:00 PM and moves all the files, which it sees, on landing server into HDFS. The Oozie works as follows: Oozie picks the file and cleans it using Pig Script to replace all the delimiters from comma (,) to pipes (|). We will write the same code using Pig and Map Reduce. We then push those processed files into a Hive table. For different source system which is database based MySQL table we do nightly Sqoop when the load of Database in light. So we extract all the records that have been generated on previous business day. The output of that also we insert into Hive tables. Analyst and Data scientists write there magical Hive scripts and Spark machine learning models on those Hive tables. We will use Oozie to schedule all of these regular tasks. Node types Workflow is composed on nodes; the logical DAG of nodes represents 'what' part of the work done by Oozie. Each of the node does specified work and on success moves to one node or on failure moves to other node. For example on success go to OK node and on fail goes to Kill node. Nodes in the Oozie workflow are of the following types. Control flow nodes These nodes are responsible for defining start, end, and control flow of what to do inside the workflow. These can be from following: Start node End node Kill node Decision node Fork and Join node Action nodes Actions nodes represent the actual processing tasks, which are executed when called. These are of various types for example Pig action, Hive action, and Mapreduce action. Summary So in this article we looked at the concepts of Oozie in brief. We also learnt the types on nodes in Oozie. Resources for Article: Further resources on this subject: Introduction to Hadoop[article] Hadoop and HDInsight in a Heartbeat[article] Cloudera Hadoop and HP Vertica [article]

0
0
1574

article-image-introducing-test-driven-machine-learning

Packt

14 Oct 2015

19 min read

Introducing Test-driven Machine Learning

Packt

14 Oct 2015

19 min read

In this article by Justin Bozonier, the author of the book Test Driven Machine Learning, we will see how to develop complex software (sometimes rooted in randomness) in small, controlled steps also it will guide you on how to begin developing solutions to machine learning problems using test-driven development (from here, this will be written as TDD). Mastering TDD is not something the book will achieve. Instead, the book will help you begin your journey and expose you to guiding principles, which you can use to creatively solve challenges as you encounter them. We will answer the following three questions in this article: What are TDD and behavior-driven development (BDD)? How do we apply these concepts to machine learning, and making inferences and predictions? How does this work in practice? (For more resources related to this topic, see here.) After having answers to these questions, we will be ready to move onto tackling real problems. The book is about applying these concepts to solve machine learning problems. This article is the largest theoretical explanation that we will have with the remainder of the theory being described by example. Due to the focus on application, you will learn much more than what you can learn about the theory of TDD and BDD. To read more about the theory and ideals, search the internet for articles written by the following: Kent Beck—The father of TDD Dan North—The father of BDD Martin Fowler—The father of refactoring, he has also created a large knowledge base, on these topics James Shore—one of the author of The Art of Agile Development, has a deep theoretical understanding of TDD, and explains the practical value of it quite well These concepts are incredibly simple and yet can take a lifetime to master. When applied to machine learning, we must find new ways to control and/or measure the random processes inherent in the algorithm. This will come up in this article as well as others. In the next section, we will develop a foundation for TDD and begin to explore its application. Test-driven development Kent Beck wrote in his seminal book on the topic that TDD consists of only two specific rules, which are as follows: Don't write a line of new code unless you first have a failing automated test Eliminate duplication This as he noted fairly quickly leads us to a mantra, really the mantra of TDD: Red, Green, Refactor. If this is a bit abstract, let me restate it that TDD is a software development process that enables a programmer to write code that specifies the intended behavior before writing any software to actually implement the behavior. The key value of TDD is that at each step of the way, you have working software as well as an itemized set of specifications. TDD is a software development process that requires the following: Writing code to detect the intended behavioral change. Rapid iteration cycle that produces working software after each iteration. It clearly defines what a bug is. If a test is not failing but a bug is found, it is not a bug. It is a new feature. Another point that Kent makes is that ultimately, this technique is meant to reduce fear in the development process. Each test is a checkpoint along the way to your goal. If you stray too far from the path and wind up in trouble, you can simply delete any tests that shouldn't apply, and then work your code back to a state where the rest of your tests pass. There's a lot of trial and error inherent in TDD, but the same matter applies to machine learning. The software that you design using TDD will also be modular enough to be able to have different components swapped in and out of your pipeline. You might be thinking that just thinking through test cases is equivalent to TDD. If you are like the most people, what you write is different from what you might verbally say, and very different from what you think. By writing the intent of our code before we write our code, it applies a pressure to our software design that prevents you from writing "just in case" code. By this I mean the code that we write just because we aren't sure if there will be a problem. Using TDD, we think of a test case, prove that it isn't supported currently, and then fix it. If we can't think of a test case, we don't add code. TDD can and does operate at many different levels of the software under development. Tests can be written against functions and methods, entire classes, programs, web services, neural networks, random forests, and whole machine learning pipelines. At each level, the tests are written from the perspective of the prospective client. How does this relate to machine learning? Lets take a step back and reframe what I just said. In the context of machine learning, tests can be written against functions, methods, classes, mathematical implementations, and the entire machine learning algorithms. TDD can even be used to explore technique and methods in a very directed and focused manner, much like you might use a REPL (an interactive shell where you can try out snippets of code) or the interactive (I)Python session. The TDD cycle The TDD cycle consists of writing a small function in the code that attempts to do something that we haven't programmed yet. These small test methods will have three main sections; the first section is where we set up our objects or test data; another section is where we invoke the code that we're testing; and the last section is where we validate that what happened is what we thought would happen. You will write all sorts of lazy code to get your tests to pass. If you are doing it right, then someone who is watching you should be appalled at your laziness and tiny steps. After the test goes green, you have an opportunity to refactor your code to your heart's content. In this context, refactor refers to changing how your code is written, but not changing how it behaves. Lets examine more deeply the three steps of TDD: Red, Green, and Refactor. Red First, create a failing test. Of course, this implies that you know what failure looks like in order to write the test. At the highest level in machine learning, this might be a baseline test where baseline is a better than random test. It might even be predicts random things, or even simpler always predicts the same thing. Is this terrible? Perhaps, it is to some who are enamored with the elegance and artistic beauty of his/her code. Is it a good place to start, though? Absolutely. A common issue that I have seen in machine learning is spending so much time up front, implementing the one true algorithm that hardly anything ever gets done. Getting to outperform pure randomness, though, is a useful change that can start making your business money as soon as it's deployed. Green After you have established a failing test, you can start working to get it green. If you start with a very high-level test, you may find that it helps to conceptually break that test up into multiple failing tests that are the lower-level concerns. I'll dive deeper into this later on in this article but for now, just know that you want to get your test passing as soon as possible; lie, cheat, and steal to get there. I promise that cheating actually makes your software's test suite that much stronger. Resist the urge to write the software in an ideal fashion. Just slap something together. You will be able to fix the issues in the next step. Refactor You got your test to pass through all the manners of hackery. Now, you get to refactor your code. Note that it is not to be interpreted loosely. Refactor specifically means to change your software without affecting its behavior. If you add the if clauses, or any other special handling, you are no longer refactoring. Then you write the software without tests. One way where you will know for sure that you are no longer refactoring is that you've broken previously passing tests. If this happens, we back up our changes until our tests pass again. It may not be obvious but this isn't all that it takes for you to know that you haven't changed behavior. Read Refactoring: Improving the Design of Existing Code, Martin Fowler for you to understand how much you should really care for refactoring. By the way of his illustration in this book, refactoring code becomes a set of forms and movements not unlike karate katas. This is a lot of general theory, but what does a test actually look like? How does this process flow in a real problem? Behavior-driven development BDD is the addition of business concerns to the technical concerns more typical of TDD. This came about as people became more experienced with TDD. They started noticing some patterns in the challenges that they were facing. One especially influential person, Dan North, proposed some specific language and structure to ease some of these issues. Some issues he noticed were the following: People had a hard time understanding what they should test next. Deciding what to name a test could be difficult. How much to test in a single test always seemed arbitrary. Now that we have some context, we can define what exactly BDD is. Simply put, it's about writing our tests in such a way that they will tell us the kind of behavior change they affect. A good litmus test might be asking oneself if the test you are writing would be worth explaining to a business stakeholder. How this solves the previous may not be completely obvious, but it may help to illustrate what this looks like in practice. It follows a structure of given, when, then. Committing to this style completely can require specific frameworks or a lot of testing ceremony. As a result, I loosely follow this in my tests as you will see soon. Here's a concrete example of a test description written in this style Given an empty dataset when the classifier is trained, it should throw an invalid operation exception. This sentence probably seems like a small enough unit of work to tackle, but notice that it's also a piece of work that any business user, who is familiar with the domain that you're working in, would understand and have an opinion on. You can read more about Dan North's point of view in this article on his website at dannorth.net/introducing-bdd/. The BDD adherents tend to use specialized tools to make the language and test result reports be as accessible to business stakeholders as possible. In my experience and from my discussions with others, this extra elegance is typically used so little that it doesn't seem worthwhile. The approach you will learn in the book will take a simplicity first approach to make it as easy as possible for someone with zero background to get up to speed. With this in mind, lets work through an example. Our first test Let's start with an example of what a test looks like in Python. The main reason for using this is that while it is a bit of a pain to install a library, this library, in particular, will make everything that we do much simpler. The default unit test solution in Python requires a heavier set up. On top of this, by using nose, we can always mix in tests that use the built-in solution where we find that we need the extra features. First, install it like this: pip install nose If you have never used pip before, then it is time for you to know that it is a very simple way to install new Python libraries. Now, as a hello world style example, lets pretend that we're building a class that will guess a number using the previous guesses to inform it. This is the first simplest example to get us writing some code. We will use the TDD cycle that we discussed previously, and write our first test in painstaking detail. After we get through our first test and have something concrete to discuss, we will talk about the anatomy of the test that we wrote. First, we must write a failing test. The simplest failing test that I can think of is the following: def given_no_information_when_asked_to_guess_test(): number_guesser = NumberGuesser() result = number_guesser.guess() assert result is None, "Then it should provide no result." The context for assert is in the test name. Reading the test name and then the assert name should do a pretty good job of describing what is being tested. Notice that in my test, I instantiate a NumberGuesser object. You're not missing any steps; this class doesn't exist yet. This seems roughly like how I'd want to use it. So, it's a great place to start with. Since it doesn't exist, wouldn't you expect this test to fail? Lets test this hypothesis. To run the test, first make sure your test file is saved so that it ends in _tests.py. From the directory with the previous code, just run the following: nosetests When I do this, I get the following result: Here's a lot going on here, but the most informative part is near the end. The message is saying that NumberGuesser does not exist yet, which is exactly what I expected since we haven't actually written the code yet. Throughout the book, we'll reduce the detail of the stack traces that we show. For now, we'll keep things detailed to make sure that we're on the same page. At this point, we're in a red state in the TDD cycle. Use the following steps to create our first successful test: Now, create the following class in a file named NumberGuesser.py: class NumberGuesser: """Guesses numbers based on the history of your input"" Import the new class at the top of my test file with a simple import NumberGuesser statement. I rerun nosetests, and get the following: TypeError: 'module' object is not callable Oh whoops! I guess that's not the right way to import the class. This is another very tiny step, but what is important is that we are making forward progress through constant communication with our tests. We are going through extreme detail because I can't stress this point enough; bear with me for the time being. Change the import statement to the following: from NumberGuesser import NumberGuesser Rerun nosetests and you will see the following: AttributeError: NumberGuesser instance has no attribute 'guess' The error message has changed, and is leading to the next thing that needs to be changed. From here, just implement what we think we need for the test to pass: class NumberGuesser: """Guesses numbers based on the history of your input""" def guess(self): return None On rerunning the nosetests, we'll get the following result: That's it! Our first successful test! Some of these steps seem so tiny so as to not being worthwhile. Indeed, overtime, you may decide that you prefer to work on a different level of detail. For the sake of argument, we'll be keeping our steps pretty small if only to illustrate just how much TDD keeps us on track and guides us on what to do next. We all know how to write the code in very large, uncontrolled steps. Learning to code surgically requires intentional practice, and is worth doing explicitly. Lets take a step back and look at what this first round of testing took. Anatomy of a test Starting from a higher level, notice how I had a dialog with Python. I just wrote the test and Python complained that the class that I was testing didn't exist. Next, I created the class, but then Python complained that I didn't import it correctly. So then, I imported it correctly, and Python complained that my guess method didn't exist. In response, I implemented the way that my test expected, and Python stopped complaining. This is the spirit of TDD. You have a conversation between you and your system. You can work in steps as little or as large as you're comfortable with. What I did previously could've been entirely skipped over, and the Python class could have been written and imported correctly the first time. The longer you go without talking to the system, the more likely you are to stray from the path to getting things working as simply as possible. Lets zoom in a little deeper and dissect this simple test to see what makes it tick. Here is the same test, but I've commented it, and broken it into sections that you will see recurring in every test that you write: def given_no_information_when_asked_to_guess_test(): # given number_guesser = NumberGuesser() # when guessed_number = number_guesser.guess() # then assert guessed_number is None, 'there should be no guess.' Given This section sets up the context for the test. In the previous test, you acquired that I didn't provide any prior information to the object. In many of our machine learning tests, this will be the most complex portion of our test. We will be importing certain sets of data, sometimes making a few specific issues in the data and testing our software to handle the details that we would expect. When you think about this section of your tests, try to frame it as Given this scenario… In our test, we might say Given no prior information for NumberGuesser… When This should be one of the simplest aspects of our test. Once you've set up the context, there should be a simple action that triggers the behavior that you want to test. When you think about this section of your tests, try to frame it as When this happens… In our test we might say When NumberGuesser guesses a number… Then This section of our test will check on the state of our variables and any return result if applicable. Again, this section should also be fairly straight-forward, as there should be only a single action that causes a change into your object under the test. The reason for this is that if it takes two actions to form a test, then it is very likely that we will just want to combine the two into a single action that we can describe in terms that are meaningful in our domain. A key example maybe loading the training data from a file and training a classifier. If we find ourselves doing this a lot, then why not just create a method that loads data from a file for us? In the book, you will find examples where we'll have the helper functions help us determine whether our results have changed in certain ways. Typically, we should view these helper functions as code smells. Remember that our tests are the first applications of our software. Anything that we have to build in addition to our code, to understand the results, is something that we should probably (there are exceptions to every rule) just include in the code we are testing. Given, When, Then is not a strong requirement of TDD, because our previous definition of TDD only consisted of two things (all that the code requires is a failing test first and an eliminate duplication). It's a small thing to be passionate about and if it doesn't speak to you, just translate this back into Arrange, act, assert in your head. At the very least, consider it as well as why these specific, very deliberate words are used. Applied to machine learning At this point, you maybe wondering how TDD will be used in machine learning, and whether we use it on regression or classification problems. In every machine learning algorithm, there exists a way to quantify the quality of what you're doing. In the linear regression; it's your adjusted R2 value; in classification problems, it's an ROC curve (and the area beneath it) or a confusion matrix, and more. All of these are testable quantities. Of course, none of these quantities have a built-in way of saying that the algorithm is good enough. We can get around this by starting our work on every problem by first building up a completely naïve and ignorant algorithm. The scores that we get for this will basically represent a plain, old, and random chance. Once we have built an algorithm that can beat our random chance scores, we just start iterating, attempting to beat the next highest score that we achieve. Benchmarking algorithms are an entire field onto their own right that can be delved in more deeply. In the book, we will implement a naïve algorithm to get a random chance score, and we will build up a small test suite that we can then use to pit this model against another. This will allow us to have a conversation with our machine learning models in the same manner as we had with Python earlier. For a professional machine learning developer, it's quite likely that an ideal metric to test is a profitability model that compares risk (monetary exposure) to expected value (profit). This can help us keep a balanced view of how much error and what kind of error we can tolerate. In machine learning, we will never have a perfect model, and we can search for the rest of our lives for the best model. By finding a way to work your financial assumptions into the model, we will have an improved ability to decide between the competing models. Summary In this article, you were introduced to TDD as well as BDD. With these concepts introduced, you have a basic foundation with which to approach machine learning. We saw that the specifying behavior in the form of sentences makes for an easier to ready a set of specifications for your software. Building off of that foundation, we started to delve into testing at a higher level. We did this by establishing concepts that we can use to quantify classifiers: the ROC curve and AUC metric. Now, we've seen that different models can be quantified; it follows that they can be compared. Putting all of this together, we have everything we need to explore machine learning with a test-driven methodology. Resources for Article: Further resources on this subject: Optimization in Python[article] How to do Machine Learning with Python[article] Modeling complex functions with artificial neural networks [article]

0
0
2665

article-image-transactions-and-operators

Packt

13 Oct 2015

14 min read

Transactions and Operators

Packt

13 Oct 2015

14 min read

In this article by Emilien Kenler and Federico Razzoli, author of the book MariaDB Essentials, he has explained in brief about transactions and operators. (For more resources related to this topic, see here.) Understanding transactions A transaction is a sequence of SQL statements that are grouped into a single logical operation. Its purpose is to guarantee the integrity of data. If a transaction fails, no change will be applied to the databases. If a transaction succeeds, all the statements will succeed. Take a look at the following example: START TRANSACTION; SELECT quantity FROM product WHERE id = 42; UPDATE product SET quantity = quantity - 10 WHERE id = 42; UPDATE customer SET money = money - 0(SELECT price FROM product WHERE id = 42) WHERE id = 512; INSERT INTO product_order (product_id, quantity, customer_id) VALUES (42, 10, 512); COMMIT; We haven't yet discussed some of the statements used in this example. However, they are not important to understand transactions. This sequence of statements occur when a customer (whose id is 512) ordered a product (whose id is 42). As a consequence, we need to execute the following suboperations in our database: Check whether the desired quantity of products is available. If not, we should not proceed Decrease the available quantity of items for the product that is being bought Decrease the amount of money in the online account of our customer Register the order so that the product is delivered to our customer These suboperations form a more complex operation. When a session is executing this operation, we do not want other connections to interfere. Consider the following scenarios: Connection checks how many products with the ID 42 are available. Only one is available, but it is enough. Immediately after, the connection B checks the availability of the same product. It finds that one is available. Connection A decreases the quantity of the product. Now, it is 0. Connection B decreases the same number. Now, it is -1. Both connections create an order. Two persons will pay for the same product; however, only one is available. This is something we definitely want to avoid. However, there is another situation that we want to avoid. Imagine that the server crashes immediately after the customer's money is deducted. The order will not be written to the database, so the customer will end up paying for something he will not receive. Fortunately, transactions prevent both these situations. They protect our database writes in two ways: During a transaction, relevant data is locked or copied. In both these cases, two connections will not be able to modify the same rows at the same time. The writes will not be made effective until the COMMIT command is issued. This means that if the server crashes during the transaction, all the suboperations will be rolled back. We will not have inconsistent data (such as a payment for a product that will not be delivered). In this example, the transaction starts when we issue the START TRANSACTION command. Then, any number of operations can be performed. The COMMIT command makes the changes effective. This does not mean that if a statement fails with an error, the transaction is always aborted. In many cases, the application will receive an error and will be free to decide whether the transaction should be aborted or not. To abort the current transaction, an application can execute the ROLLBACK command. A transaction can consist of only one statement. This perfectly makes sense because the server could crash in the middle of the statement's execution. The autocommit mode In many cases, we don't want to group multiple statements in a transaction. When a transaction consists of only one statement, sending the START TRANSACTION and COMMIT statements can be annoying. For this reason, MariaDB has the autocommit mode. By default, the autocommit mode is ON. Unless a START TRANSACTION command is explicitly used, the autocommit mode causes an implicit commit after each statement. Thus, every statement is executed in a separated transaction by default. When the autocommit mode is OFF, a new transaction implicitly starts after each commit, and the COMMIT command needs be issued explicitly. To turn the autocommit ON or OFF, we can use the @@autocommit server variable as follows: follows: MariaDB [mwa]> SET @@autocommit = OFF; Query OK, 0 rows affected (0.00 sec) MariaDB [mwa]> SELECT @@autocommit; +--------------+ | @@autocommit | +--------------+ | 0 | +--------------+ 1 row in set (0.00 sec) Transaction's limitations in MariaDB Transaction handling is not implemented in the core of MariaDB; instead, it is left to the storage engines. Many storage engines, such as MyISAM or MEMORY, do not implement it at all. Some of the transactional storage engines are: InnoDB; XtraDB; TokuDB. In a sense, Aria tables are partially transactional. Although Aria ignores commands such as START TRANSACTION, COMMIT, and ROLLBACK, each statement is somewhat a transaction. In fact, if it writes, modifies, or deletes multiple rows, the operation completely succeeds or fails, which is similar to a transaction. Only statements that modify data can be used in a transaction. Statements that modify a table structure (such as ALTER TABLE) implicitly commit the current transaction. Sometimes, we may not be sure if a transaction is active or not. Usually, this happens because we are not sure if autocommit is set to ON or not or because we are not sure if the latest statement implicitly committed a transaction. In these cases, the @in_transaction variable can help us. Its value is 1 if a transaction is active and 0 if it is not. Here is an example: MariaDB [mwa]> START TRANSACTION; Query OK, 0 rows affected (0.00 sec) MariaDB [mwa]> SELECT @@in_transaction; +------------------+ | @@in_transaction | +------------------+ | 1 | +------------------+ 1 row in set (0.00 sec) MariaDB [mwa]> DROP TABLE IF EXISTS t; Query OK, 0 rows affected, 1 warning (0.00 sec) MariaDB [mwa]> SELECT @@in_transaction; +------------------+ | @@in_transaction | +------------------+ | 0 | +------------------+ 1 row in set (0.00 sec) InnoDB is optimized to execute a huge number of short transactions. If our databases are busy and performance is important to us, we should try to avoid big transactions in terms of the number of statements and execution time. This is particularly true if we have several concurrent connections that read the same tables. Working with operators In our examples, we have used several operators, such as equals (=), less-than and greater-than (<, >), and so on. Now, it is time to discuss operators in general and list the most important ones. In general, an operator is a sign that takes one or more operands and returns a result. Several groups of operators exist in MariaDB. In this article, we will discuss the main types: Comparison operators; String operators; Logical operators; Arithmetic operators. Comparison operators A comparison operator checks whether there is a certain relation between its operands. If the relationship exists, the operator returns 1; otherwise, it returns 0. For example, let's take the equality operator that is probably used the most: 1 = 1 -- returns 1: the equality relationship exists 1 = 0 -- returns 0: no equality relationship here In MariaDB, 1 and 0 are used in many contexts to indicate whether something is true or false. In fact, MariaDB does not have a Boolean data type, so TRUE and FALSE are merely used as aliases for 1 and 0: TRUE = 1 -- returns 1 FALSE = 0 -- returns 1 TRUE = FALSE -- returns 0 In a WHERE clause, a result of 0 or NULL prevents a row to be shown. All the numeric results other than 0, including negative numbers, are regarded as true in this context. Non-numeric values other than NULL need to be converted to numbers in order to be evaluated by the WHERE clause. Non-numeric strings are converted to 0, whereas numeric strings are treated as numbers. Dates are converted to nonzero numbers.Consider the following example: WHERE 1 -- is redundant; it shows all the rows WHERE 0 -- prevents all the rows from being shown Now, let's take a look at the following MariaDB comparison operators: Operator Description Example = This specifies equality A = B != This indicates inequality A != B <> This is the synonym for != A <> B < This denotes less than A < B > This indicates greater than A > B <= This refers to less than or equals to A <= B >= This specifies greater than or equals to A >= B IS NULL This indicates that the operand is NULL A IS NULL IS NOT NULL The operand is not NULL A IS NOT NULL <=> This denotes that the operands are equal, or both are NULL A <=> B BETWEEN ... AND This specifies that the left operand is within a range of values A BETWEEN B AND C NOT BETWEEN ... AND This indicates that the left operand is outside the specified range A NOT BETWEEN B AND C IN This denotes that the left operand is one of the items in a given list A IN (B, C, D) NOT IN This indicates that the left operand is not in the given list A NOT IN (B, C, D) Here are a couple of examples: SELECT id FROM product WHERE price BETWEEN 100 AND 200; DELETE FROM product WHERE id IN (100, 101, 102); Special attention should be paid to NULL values. Almost all the preceding operators return NULL if any of their operands is NULL. The reason is quite clear, that is, as NULL represents an unknown value, any operation involving a NULL operand returns an unknown result. However, there are some operators specifically designed to work with NULL values. IS NULL and IS NOT NULL checks whether the operand is NULL. The <=> operator is a shortcut for the following code: a = b OR (a IS NULL AND b IS NULL) String operators MariaDB supports certain comparison operators that are specifically designed to work with string values. This does not mean that other operators does not work well with strings. For example, A = B perfectly works if A and B are strings. However, some particular comparisons only make sense with text values. Let's take a look at them. The LIKE operator and its variants This operator is often used to check whether a string starts with a given sequence of characters, if it ends with that sequence, or if it contains the sequence. More generally, LIKE checks whether a string follows a given pattern. Its syntax is: <string_value> LIKE <pattern> The pattern is a string that can contain the following wildcard characters: _ (underscore) means: This specifies any character %: This denotes any sequence of 0 or more characters There is also a way to include these characters without their special meaning: the _ and % sequences represent the a_ and a% characters respectively. For example, take a look at the following expressions: my_text LIKE 'h_' my_text LIKE 'h%' The first expression returns 1 for 'hi', 'ha', or 'ho', but not for 'hey'. The second expression returns 1 for all these strings, including 'hey'. By default, LIKE is case insensitive, meaning that 'abc' LIKE 'ABC' returns 1. Thus, it can be used to perform a case insensitive equality check. To make LIKE case sensitive, the following BINARY keyword can be used: my_text LIKE BINARY your_text The complement of LIKE is NOT LIKE, as shown in the following code: <string_value> NOT LIKE <pattern> Here are the most common uses for LIKE: my_text LIKE 'my%' -- does my_text start with 'my'? my_text LIKE '%my' -- does my_text end with 'my'? my_text LIKE '%my%' -- does my_text contain 'my'? More complex uses are possible for LIKE. For example, the following expression can be used to check whether mail is a valid e-mail address: mail LIKE '_%@_%.__%' The preceding code snippet checks whether mail contains at least one character, a '@' character, at least one character, a dot, at least two characters in this order. In most cases, an invalid e-mail address will not pass this test. Using regular expressions with the REGEXP operator and its variants Regular expressions are string patterns that contain a meta character with special meanings in order to perform match operations and determine whether a given string matches the given pattern or not. The REGEXP operator is somewhat similar to LIKE. It checks whether a string matches a given pattern. However, REGEXP uses regular expressions with the syntax defined by the POSIX standard. Basically, this means that: Many developers, but not all, already know their syntax REGEXP uses a very expressive syntax, so the patterns can be much more complex and detailed REGEXP is much slower than LIKE; this should be preferred when possible The regular expressions syntax is a complex topic, and it cannot be covered in this article. Developers can learn about regular expressions at www.regular-expressions.info. The complement of REGEXP is NOT REGEXP. Logical operators Logical operators can be used to combine truth expressions that form a compound expression that can be true, false, or NULL. Depending on the truth values of its operands, a logical operator can return 1 or 0. MariaDB supports the following logical operators: NOT; AND; OR; XOR The NOT operator NOT is the only logical operator that takes one operand. It inverts its truth value. If the operand is true, NOT returns 0, and if the operand is false, NOT returns 1. If the operand is NULL, NOT returns NULL. Here is an example: NOT 1 -- returns 0 NOT 0 -- returns 1 NOT 1 = 1 -- returns 0 NOT 1 = NULL -- returns NULL NOT 1 <=> NULL -- returns 0 The AND operator AND returns 1 if both its operands are true and 0 in all other cases. Here is an example: 1 AND 1 -- returns 1 0 AND 1 -- returns 0 0 AND 0 -- returns 0 The OR operator OR returns 1 if at least one of its operators is true or 0 if both the operators are false. Here is an example: 1 OR 1 -- returns 1 0 OR 1 -- returns 1 0 OR 0 -- returns 0 The XOR operator XOR stands for eXclusive OR. It is the least used logical operator. It returns 1 if only one of its operators is true or 0 if both the operands are true or false. Take a look at the following example: 1 XOR 1 -- returns 0 1 XOR 0 -- returns 1 0 XOR 1 --returns 1 0 XOR 0 -- returns 0 A XOR B is the equivalent of the following expression: (A OR B) AND NOT (A AND B) Or: (NOT A AND B) OR (A AND NOT B) Arithmetic operators MariaDB supports the operators that are necessary to execute all the basic arithmetic operations. The supported arithmetic operators are: + for additions - for subtractions * for multiplication / for division Depending on the MariaDB configuration, remember that a division by 0 raises an error or returns NULL. In addition, two more operators are useful for divisions: DIV: This returns the integer part of a division without any decimal part or reminder MOD or %: This returns the reminder of a division Here is an example: MariaDB [(none)]> SELECT 20 DIV 3 AS int_part, 20 MOD 3 AS modulus; +----------+---------+ | int_part | modulus | +----------+---------+ | 6 | 2 | +----------+---------+ 1 row in set (0.00 sec) Operators precedence MariaDB does not blindly evaluate the expression from left to right. Every operator has a given precedence. The And operators that is evaluated before another one is said to have a higher precedence. In general, arithmetic and string operators have a higher priority than logical operators. The precedence of arithmetic operators reflect their precedence in common mathematical expressions. It is very important to remember the precedence of logical operators (from the highest to the lowest): NOT AND XOR OR MariaDB supports many operators, and we did not discuss all of them. Also, the exact precedence can slightly vary depending on the MariaDB configuration. The complete precedence can be found in the MariaDB KnowledgeBase, at https://mariadb.com/kb/en/mariadb/documentation/functions-and-operators/operator-precedence/. Parenthesis can be used to force MariaDB to follow a certain order. They are also useful when we do not remember the exact precedence of the operators that we will use, as shown in the following code: (NOT (a AND b)) OR c OR d Summary In this article you learned about the basic transactions and operators. Resources for Article: Further resources on this subject: Set Up MariaDB [Article] Installing MariaDB on Windows and Mac OS X [Article] Building a Web Application with PHP and MariaDB – Introduction to caching [Article]

0
0
1406

Packt

12 Oct 2015

6 min read

Securing Your Data

Packt

12 Oct 2015

6 min read

In this article by Tyson Cadenhead, author of Socket.IO Cookbook, we will explore several topics related to security in Socket.IO applications. These topics will cover the gambit, from authentication and validation to how to use the wss:// protocol for secure WebSockets. As the WebSocket protocol opens innumerable opportunities to communicate more directly between the client and the server, people often wonder if Socket.IO is actually as secure as something such as the HTTP protocol. The answer to this question is that it depends entirely on how you implement it. WebSockets can easily be locked down to prevent malicious or accidental security holes, but as with any API interface, your security is only as tight as your weakest link. In this article, we will cover the following topics: Locking down the HTTP referrer Using secure WebSockets (For more resources related to this topic, see here.) Locking down the HTTP referrer Socket.IO is really good at getting around cross-domain issues. You can easily include the Socket.IO script from a different domain on your page, and it will just work as you may expect it to. There are some instances where you may not want your Socket.IO events to be available on every other domain. Not to worry! We can easily whitelist only the http referrers that we want so that some domains will be allowed to connect and other domains won't. How To Do It… To lock down the HTTP referrer and only allow events to whitelisted domains, follow these steps: Create two different servers that can connect to our Socket.IO instance. We will let one server listen on port 5000 and the second server listen on port 5001: var express = require('express'), app = express(), http = require('http'), socketIO = require('socket.io'), server, server2, io; app.get('/', function (req, res) { res.sendFile(__dirname + '/index.html'); }); server = http.Server(app); server.listen(5000); server2 = http.Server(app); server2.listen(5001); io = socketIO(server); When the connection is established, check the referrer in the headers. If it is a referrer that we want to give access to, we can let our connection perform its tasks and build up events as normal. If a blacklisted referrer, such as the one on port 5001 that we created, attempts a connection, we can politely decline and perhaps throw an error message back to the client, as shown in the following code: io.on('connection', function (socket) { switch (socket.request.headers.referer) { case 'http://localhost:5000/': socket.emit('permission.message', 'Okay, you're cool.'); break; default: returnsocket.emit('permission.message', 'Who invited you to this party?'); break; } }); On the client side, we can listen to the response from the server and react as appropriate using the following code: socket.on('permission.message', function (data) { document.querySelector('h1').innerHTML = data; }); How It Works… The referrer is always available in the socket.request.headers object of every socket, so we will be able to inspect it there to check whether it was a trusted source. In our case, we will use a switch statement to whitelist our domain on port 5000, but we could really use any mechanism at our disposal to perform the task. For example, if we need to dynamically whitelist domains, we can store a list of them in our database and search for it when the connection is established. Using secure WebSockets WebSocket communications can either take place over the ws:// protocol or the wss:// protocol. In similar terms, they can be thought of as the HTTP and HTTPS protocols in the sense that one is secure and one isn't. Secure WebSockets are encrypted by the transport layer, so they are safer to use when you handle sensitive data. In this recipe, you will learn how to force our Socket.IO communications to happen over the wss:// protocol for an extra layer of encryption. Getting Ready… In this recipe, we will need to create a self-signing certificate so that we can serve our app locally over the HTTPS protocol. For this, we will need an npm package called Pem. This allows you to create a self-signed certificate that you can provide to your server. Of course, in a real production environment, we would want a true SSL certificate instead of a self-signed one. To install Pem, simply call npm install pem –save. As our certificate is self-signed, you will probably see something similar to the following screenshot when you navigate to your secure server: Just take a chance by clicking on the Proceed to localhost link. You'll see your application load using the HTTPS protocol. How To Do It… To use the secure wss:// protocol, follow these steps: First, create a secure server using the built-in node HTTPS package. We can create a self-signed certificate with the pem package so that we can serve our application over HTTPS instead of HTTP, as shown in the following code: var https = require('https'), pem = require('pem'), express = require('express'), app = express(), socketIO = require('socket.io'); // Create a self-signed certificate with pem pem.createCertificate({ days: 1, selfSigned: true }, function (err, keys) { app.get('/', function(req, res){ res.sendFile(__dirname + '/index.html'); }); // Create an https server with the certificate and key from pem var server = https.createServer({ key: keys.serviceKey, cert: keys.certificate }, app).listen(5000); vario = socketIO(server); io.on('connection', function (socket) { var protocol = 'ws://'; // Check the handshake to determine if it was secure or not if (socket.handshake.secure) { protocol = 'wss://'; } socket.emit('hello.client', { message: 'This is a message from the server. It was sent using the ' + protocol + ' protocol' }); }); }); In your client-side JavaScript, specify secure: true when you initialize your WebSocket as follows: var socket = io('//localhost:5000', { secure: true }); socket.on('hello.client', function (data) { console.log(data); }); Now, start your server and navigate to https://localhost:5000. Proceed to this page. You should see a message in your browser developer tools that shows, This is a message from the server. It was sent using the wss:// protocol. How It Works… The protocol of our WebSocket is actually set automatically based on the protocol of the page that it sits on. This means that a page that is served over the HTTP protocol will send the WebSocket communications over ws:// by default, and a page that is served by HTTPS will default to using the wss:// protocol. However, by setting the secure option to true, we told the WebSocket to always serve through wss:// no matter what. Summary In this article, we gave you an overview of the topics related to security in Socket.IO applications. Resources for Article: Further resources on this subject: Using Socket.IO and Express together[article] Adding Real-time Functionality Using Socket.io[article] Welcome to JavaScript in the full stack [article]

0
0
698

article-image-basics-jupyter-notebook-python

Packt Editorial Staff

11 Oct 2015

28 min read

Basics of Jupyter Notebook and Python

Packt Editorial Staff

11 Oct 2015

28 min read

In this article by Cyrille Rossant, coming from his book, Learning IPython for Interactive Computing and Data Visualization - Second Edition, we will see how to use IPython console, Jupyter Notebook, and we will go through the basics of Python. Originally, IPython provided an enhanced command-line console to run Python code interactively. The Jupyter Notebook is a more recent and more sophisticated alternative to the console. Today, both tools are available, and we recommend that you learn to use both. [box type="note" align="alignleft" class="" width=""]The first chapter of the book, Chapter 1, Getting Started with IPython, contains all installation instructions. The main step is to download and install the free Anaconda distribution at https://www.continuum.io/downloads (the version of Python 3 64-bit for your operating system).[/box] Launching the IPython console To run the IPython console, type ipython in an OS terminal. There, you can write Python commands and see the results instantly. Here is a screenshot: IPython console The IPython console is most convenient when you have a command-line-based workflow and you want to execute some quick Python commands. You can exit the IPython console by typing exit. [box type="note" align="alignleft" class="" width=""]Let's mention the Qt console, which is similar to the IPython console but offers additional features such as multiline editing, enhanced tab completion, image support, and so on. The Qt console can also be integrated within a graphical application written with Python and Qt. See http://jupyter.org/qtconsole/stable/ for more information.[/box] Launching the Jupyter Notebook To run the Jupyter Notebook, open an OS terminal, go to ~/minibook/ (or into the directory where you've downloaded the book's notebooks), and type jupyter notebook. This will start the Jupyter server and open a new window in your browser (if that's not the case, go to the following URL: http://localhost:8888). Here is a screenshot of Jupyter's entry point, the Notebook dashboard: The Notebook dashboard [box type="note" align="alignleft" class="" width=""]At the time of writing, the following browsers are officially supported: Chrome 13 and greater; Safari 5 and greater; and Firefox 6 or greater. Other browsers may work also. Your mileage may vary.[/box] The Notebook is most convenient when you start a complex analysis project that will involve a substantial amount of interactive experimentation with your code. Other common use-cases include keeping track of your interactive session (like a lab notebook), or writing technical documents that involve code, equations, and figures. In the rest of this section, we will focus on the Notebook interface. [box type="note" align="alignleft" class="" width=""]Closing the Notebook server To close the Notebook server, go to the OS terminal where you launched the server from, and press Ctrl + C. You may need to confirm with y.[/box] The Notebook dashboard The dashboard contains several tabs which are as follows: Files: shows all files and notebooks in the current directory Running: shows all kernels currently running on your computer Clusters: lets you launch kernels for parallel computing A notebook is an interactive document containing code, text, and other elements. A notebook is saved in a file with the .ipynb extension. This file is a plain text file storing a JSON data structure. A kernel is a process running an interactive session. When using IPython, this kernel is a Python process. There are kernels in many languages other than Python. [box type="note" align="alignleft" class="" width=""]We follow the convention to use the term notebook for a file, and Notebook for the application and the web interface.[/box] In Jupyter, notebooks and kernels are strongly separated. A notebook is a file, whereas a kernel is a process. The kernel receives snippets of code from the Notebook interface, executes them, and sends the outputs and possible errors back to the Notebook interface. Thus, in general, the kernel has no notion of the Notebook. A notebook is persistent (it's a file), whereas a kernel may be closed at the end of an interactive session and it is therefore not persistent. When a notebook is re-opened, it needs to be re-executed. In general, no more than one Notebook interface can be connected to a given kernel. However, several IPython consoles can be connected to a given kernel. The Notebook user interface To create a new notebook, click on the New button, and select Notebook (Python 3). A new browser tab opens and shows the Notebook interface as follows: A new notebook Here are the main components of the interface, from top to bottom: The notebook name, which you can change by clicking on it. This is also the name of the .ipynb file. The Menu bar gives you access to several actions pertaining to either the notebook or the kernel. To the right of the menu bar is the Kernel name. You can change the kernel language of your notebook from the Kernel menu. The Toolbar contains icons for common actions. In particular, the dropdown menu showing Code lets you change the type of a cell. Following is the main component of the UI: the actual Notebook. It consists of a linear list of cells. We will detail the structure of a cell in the following sections. Structure of a notebook cell There are two main types of cells: Markdown cells and code cells, and they are described as follows: A Markdown cell contains rich text. In addition to classic formatting options like bold or italics, we can add links, images, HTML elements, LaTeX mathematical equations, and more. A code cell contains code to be executed by the kernel. The programming language corresponds to the kernel's language. We will only use Python in this book, but you can use many other languages. You can change the type of a cell by first clicking on a cell to select it, and then choosing the cell's type in the toolbar's dropdown menu showing Markdown or Code. Markdown cells Here is a screenshot of a Markdown cell: A Markdown cell The top panel shows the cell in edit mode, while the bottom one shows it in render mode. The edit mode lets you edit the text, while the render mode lets you display the rendered cell. We will explain the differences between these modes in greater detail in the following section. Code cells Here is a screenshot of a complex code cell: Structure of a code cell This code cell contains several parts, as follows: The Prompt number shows the cell's number. This number increases every time you run the cell. Since you can run cells of a notebook out of order, nothing guarantees that code numbers are linearly increasing in a given notebook. The Input area contains a multiline text editor that lets you write one or several lines of code with syntax highlighting. The Widget area may contain graphical controls; here, it displays a slider. The Output area can contain multiple outputs, here: Standard output (text in black) Error output (text with a red background) Rich output (an HTML table and an image here) The Notebook modal interface The Notebook implements a modal interface similar to some text editors such as vim. Mastering this interface may represent a small learning curve for some users. Use the edit mode to write code (the selected cell has a green border, and a pen icon appears at the top right of the interface). Click inside a cell to enable the edit mode for this cell (you need to double-click with Markdown cells). Use the command mode to operate on cells (the selected cell has a gray border, and there is no pen icon). Click outside the text area of a cell to enable the command mode (you can also press the Esc key). Keyboard shortcuts are available in the Notebook interface. Type h to show them. We review here the most common ones (for Windows and Linux; shortcuts for Mac OS X may be slightly different). Keyboard shortcuts available in both modes Here are a few keyboard shortcuts that are always available when a cell is selected: Ctrl + Enter: run the cell Shift + Enter: run the cell and select the cell below Alt + Enter: run the cell and insert a new cell below Ctrl + S: save the notebook Keyboard shortcuts available in the edit mode In the edit mode, you can type code as usual, and you have access to the following keyboard shortcuts: Esc: switch to command mode Ctrl + Shift + -: split the cell Keyboard shortcuts available in the command mode In the command mode, keystrokes are bound to cell operations. Don't write code in command mode or unexpected things will happen! For example, typing dd in command mode will delete the selected cell! Here are some keyboard shortcuts available in command mode: Enter: switch to edit mode Up or k: select the previous cell Down or j: select the next cell y / m: change the cell type to code cell/Markdown cell a / b: insert a new cell above/below the current cell x / c / v: cut/copy/paste the current cell dd: delete the current cell z: undo the last delete operation Shift + =: merge the cell below h: display the help menu with the list of keyboard shortcuts Spending some time learning these shortcuts is highly recommended. References Here are a few references: Main documentation of Jupyter at http://jupyter.readthedocs.org/en/latest/ Jupyter Notebook interface explained at http://jupyter-notebook.readthedocs.org/en/latest/notebook.html A crash course on Python If you don't know Python, read this section to learn the fundamentals. Python is a very accessible language and is even taught to school children. If you have ever programmed, it will only take you a few minutes to learn the basics. Hello world Open a new notebook and type the following in the first cell: In [1]: print("Hello world!") Out[1]: Hello world! Here is a screenshot: "Hello world" in the Notebook [box type="note" align="alignleft" class="" width=""]Prompt string Note that the convention chosen in this article is to show Python code (also called the input) prefixed with In [x]: (which shouldn't be typed). This is the standard IPython prompt. Here, you should just type print("Hello world!") and then press Shift + Enter.[/box] Congratulations! You are now a Python programmer. Variables Let's use Python as a calculator. In [2]: 2 * 2 Out[2]: 4 Here, 2 * 2 is an expression statement. This operation is performed, the result is returned, and IPython displays it in the notebook cell's output. [box type="note" align="alignleft" class="" width=""]Division In Python 3, 3 / 2 returns 1.5 (floating-point division), whereas it returns 1 in Python 2 (integer division). This can be source of errors when porting Python 2 code to Python 3. It is recommended to always use the explicit 3.0 / 2.0 for floating-point division (by using floating-point numbers) and 3 // 2 for integer division. Both syntaxes work in Python 2 and Python 3. See http://python3porting.com/differences.html#integer-division for more details.[/box] Other built-in mathematical operators include +, -, ** for the exponentiation, and others. You will find more details at https://docs.python.org/3/reference/expressions.html#the-power-operator. Variables form a fundamental concept of any programming language. A variable has a name and a value. Here is how to create a new variable in Python: In [3]: a = 2 And here is how to use an existing variable: In [4]: a * 3 Out[4]: 6 Several variables can be defined at once (this is called unpacking): In [5]: a, b = 2, 6 There are different types of variables. Here, we have used a number (more precisely, an integer). Other important types include floating-point numbers to represent real numbers, strings to represent text, and booleans to represent True/False values. Here are a few examples: In [6]: somefloat = 3.1415 sometext = 'pi is about' # You can also use double quotes. print(sometext, somefloat) # Display several variables. Out[6]: pi is about 3.1415 Note how we used the # character to write comments. Whereas Python discards the comments completely, adding comments in the code is important when the code is to be read by other humans (including yourself in the future). String escaping String escaping refers to the ability to insert special characters in a string. For example, how can you insert ' and ", given that these characters are used to delimit a string in Python code? The backslash is the go-to escape character in Python (and in many other languages too). Here are a few examples: In [7]: print("Hello "world"") print("A list:n* item 1n* item 2") print("C:pathonwindows") print(r"C:pathonwindows") Out[7]: Hello "world" A list: * item 1 * item 2 C:pathonwindows C:pathonwindows The special character n is the new line (or line feed) character. To insert a backslash, you need to escape it, which explains why it needs to be doubled as . You can also disable escaping by using raw literals with a r prefix before the string, like in the last example above. In this case, backslashes are considered as normal characters. This is convenient when writing Windows paths, since Windows uses backslash separators instead of forward slashes like on Unix systems. A very common error on Windows is forgetting to escape backslashes in paths: writing "C:path" may lead to subtle errors. You will find the list of special characters in Python at https://docs.python.org/3.4/reference/lexical_analysis.html#string-and-bytes-literals. Lists A list contains a sequence of items. You can concisely instruct Python to perform repeated actions on the elements of a list. Let's first create a list of numbers as follows: In [8]: items = [1, 3, 0, 4, 1] Note the syntax we used to create the list: square brackets [], and commas , to separate the items. The built-in function len() returns the number of elements in a list: In [9]: len(items) Out[9]: 5 [box type="note" align="alignleft" class="" width=""]Python comes with a set of built-in functions, including print(), len(), max(), functional routines like filter() and map(), and container-related routines like all(), any(), range(), and sorted(). You will find the full list of built-in functions at https://docs.python.org/3.4/library/functions.html.[/box] Now, let's compute the sum of all elements in the list. Python provides a built-in function for this: In [10]: sum(items) Out[10]: 9 We can also access individual elements in the list, using the following syntax: In [11]: items[0] Out[11]: 1 In [12]: items[-1] Out[12]: 1 Note that indexing starts at 0 in Python: the first element of the list is indexed by 0, the second by 1, and so on. Also, -1 refers to the last element, -2, to the penultimate element, and so on. The same syntax can be used to alter elements in the list: In [13]: items[1] = 9 items Out[13]: [1, 9, 0, 4, 1] We can access sublists with the following syntax: In [14]: items[1:3] Out[14]: [9, 0] Here, 1:3 represents a slice going from element 1 included (this is the second element of the list) to element 3 excluded. Thus, we get a sublist with the second and third element of the original list. The first-included/last-excluded asymmetry leads to an intuitive treatment of overlaps between consecutive slices. Also, note that a sublist refers to a dynamic view of the original list, not a copy; changing elements in the sublist automatically changes them in the original list. Python provides several other types of containers: Tuples are immutable and contain a fixed number of elements: In [15]: my_tuple = (1, 2, 3) my_tuple[1] Out[15]: 2 Dictionaries contain key-value pairs. They are extremely useful and common: In [16]: my_dict = {'a': 1, 'b': 2, 'c': 3} print('a:', my_dict['a']) Out[16]: a: 1 In [17]: print(my_dict.keys()) Out[17]: dict_keys(['c', 'a', 'b']) There is no notion of order in a dictionary. However, the native collections module provides an OrderedDict structure that keeps the insertion order (see https://docs.python.org/3.4/library/collections.html). Sets, like mathematical sets, contain distinct elements: In [18]: my_set = set([1, 2, 3, 2, 1]) my_set Out[18]: {1, 2, 3} A Python object is mutable if its value can change after it has been created. Otherwise, it is immutable. For example, a string is immutable; to change it, a new string needs to be created. A list, a dictionary, or a set is mutable; elements can be added or removed. By contrast, a tuple is immutable, and it is not possible to change the elements it contains without recreating the tuple. See https://docs.python.org/3.4/reference/datamodel.html for more details. Loops We can run through all elements of a list using a for loop: In [19]: for item in items: print(item) Out[19]: 1 9 0 4 1 There are several things to note here: The for item in items syntax means that a temporary variable named item is created at every iteration. This variable contains the value of every item in the list, one at a time. Note the colon : at the end of the for statement. Forgetting it will lead to a syntax error! The statement print(item) will be executed for all items in the list. Note the four spaces before print: this is called the indentation. You will find more details about indentation in the next subsection. Python supports a concise syntax to perform a given operation on all elements of a list, as follows: In [20]: squares = [item * item for item in items] squares Out[20]: [1, 81, 0, 16, 1] This is called a list comprehension. A new list is created here; it contains the squares of all numbers in the list. This concise syntax leads to highly readable and Pythonic code. Indentation Indentation refers to the spaces that may appear at the beginning of some lines of code. This is a particular aspect of Python's syntax. In most programming languages, indentation is optional and is generally used to make the code visually clearer. But in Python, indentation also has a syntactic meaning. Particular indentation rules need to be followed for Python code to be correct. In general, there are two ways to indent some text: by inserting a tab character (also referred to as t), or by inserting a number of spaces (typically, four). It is recommended to use spaces instead of tab characters. Your text editor should be configured such that the Tab key on the keyboard inserts four spaces instead of a tab character. In the Notebook, indentation is automatically configured properly; so you shouldn't worry about this issue. The question only arises if you use another text editor for your Python code. Finally, what is the meaning of indentation? In Python, indentation delimits coherent blocks of code, for example, the contents of a loop, a conditional branch, a function, and other objects. Where other languages such as C or JavaScript use curly braces to delimit such blocks, Python uses indentation. Conditional branches Sometimes, you need to perform different operations on your data depending on some condition. For example, let's display all even numbers in our list: In [21]: for item in items: if item % 2 == 0: print(item) Out[21]: 0 4 Again, here are several things to note: An if statement is followed by a boolean expression. If a and b are two integers, the modulo operand a % b returns the remainder from the division of a by b. Here, item % 2 is 0 for even numbers, and 1 for odd numbers. The equality is represented by a double equal sign == to avoid confusion with the assignment operator = that we use when we create variables. Like with the for loop, the if statement ends with a colon :. The part of the code that is executed when the condition is satisfied follows the if statement. It is indented. Indentation is cumulative: since this if is inside a for loop, there are eight spaces before the print(item) statement. Python supports a concise syntax to select all elements in a list that satisfy certain properties. Here is how to create a sublist with only even numbers: In [22]: even = [item for item in items if item % 2 == 0] even Out[22]: [0, 4] This is also a form of list comprehension. Functions Code is typically organized into functions. A function encapsulates part of your code. Functions allow you to reuse bits of functionality without copy-pasting the code. Here is a function that tells whether an integer number is even or not: In [23]: def is_even(number): """Return whether an integer is even or not.""" return number % 2 == 0 There are several things to note here: A function is defined with the def keyword. After def comes the function name. A general convention in Python is to only use lowercase characters, and separate words with an underscore _. A function name generally starts with a verb. The function name is followed by parentheses, with one or several variable names called the arguments. These are the inputs of the function. There is a single argument here, named number. No type is specified for the argument. This is because Python is dynamically typed; you could pass a variable of any type. This function would work fine with floating point numbers, for example (the modulo operation works with floating point numbers in addition to integers). The body of the function is indented (and note the colon : at the end of the def statement). There is a docstring wrapped by triple quotes """. This is a particular form of comment that explains what the function does. It is not mandatory, but it is strongly recommended to write docstrings for the functions exposed to the user. The return keyword in the body of the function specifies the output of the function. Here, the output is a Boolean, obtained from the expression number % 2 == 0. It is possible to return several values; just use a comma to separate them (in this case, a tuple of Booleans would be returned). Once a function is defined, it can be called like this: In [24]: is_even(3) Out[24]: False In [25]: is_even(4) Out[25]: True Here, 3 and 4 are successively passed as arguments to the function. Positional and keyword arguments A Python function can accept an arbitrary number of arguments, called positional arguments. It can also accept optional named arguments, called keyword arguments. Here is an example: In [26]: def remainder(number, divisor=2): return number % divisor The second argument of this function, divisor, is optional. If it is not provided by the caller, it will default to the number 2, as shown here: In [27]: remainder(5) Out[27]: 1 There are two equivalent ways of specifying a keyword argument when calling a function. They are as follows: In [28]: remainder(5, 3) Out[28]: 2 In [29]: remainder(5, divisor=3) Out[29]: 2 In the first case, 3 is understood as the second argument, divisor. In the second case, the name of the argument is given explicitly by the caller. This second syntax is clearer and less error-prone than the first one. Functions can also accept arbitrary sets of positional and keyword arguments, using the following syntax: In [30]: def f(*args, **kwargs): print("Positional arguments:", args) print("Keyword arguments:", kwargs) In [31]: f(1, 2, c=3, d=4) Out[31]: Positional arguments: (1, 2) Keyword arguments: {'c': 3, 'd': 4} Inside the function, args is a tuple containing positional arguments, and kwargs is a dictionary containing keyword arguments. Passage by assignment When passing a parameter to a Python function, a reference to the object is actually passed (passage by assignment): If the passed object is mutable, it can be modified by the function If the passed object is immutable, it cannot be modified by the function Here is an example: In [32]: my_list = [1, 2] def add(some_list, value): some_list.append(value) add(my_list, 3) my_list Out[32]: [1, 2, 3] The add() function modifies an object defined outside it (in this case, the object my_list); we say this function has side-effects. A function with no side-effects is called a pure function: it doesn't modify anything in the outer context, and it deterministically returns the same result for any given set of inputs. Pure functions are to be preferred over functions with side-effects. Knowing this can help you spot out subtle bugs. There are further related concepts that are useful to know, including function scopes, naming, binding, and more. Here are a couple of links: Passage by reference at https://docs.python.org/3/faq/programming.html#how-do-i-write-a-function-with-output-parameters-call-by-reference Naming, binding, and scope at https://docs.python.org/3.4/reference/executionmodel.html Errors Let's discuss errors in Python. As you learn, you will inevitably come across errors and exceptions. The Python interpreter will most of the time tell you what the problem is, and where it occurred. It is important to understand the vocabulary used by Python so that you can more quickly find and correct your errors. Let's see the following example: In [33]: def divide(a, b): return a / b In [34]: divide(1, 0) Out[34]: --------------------------------------------------------- ZeroDivisionError Traceback (most recent call last) <ipython-input-2-b77ebb6ac6f6> in <module>() ----> 1 divide(1, 0) <ipython-input-1-5c74f9fd7706> in divide(a, b) 1 def divide(a, b): ----> 2 return a / b ZeroDivisionError: division by zero Here, we defined a divide() function, and called it to divide 1 by 0. Dividing a number by 0 is an error in Python. Here, a ZeroDivisionError exception was raised. An exception is a particular type of error that can be raised at any point in a program. It is propagated from the innards of the code up to the command that launched the code. It can be caught and processed at any point. You will find more details about exceptions at https://docs.python.org/3/tutorial/errors.html, and common exception types at https://docs.python.org/3/library/exceptions.html#bltin-exceptions. The error message you see contains the stack trace, the exception type, and the exception message. The stack trace shows all function calls between the raised exception and the script calling point. The top frame, indicated by the first arrow ---->, shows the entry point of the code execution. Here, it is divide(1, 0), which was called directly in the Notebook. The error occurred while this function was called. The next and last frame is indicated by the second arrow. It corresponds to line 2 in our function divide(a, b). It is the last frame in the stack trace: this means that the error occurred there. Object-oriented programming Object-oriented programming (OOP) is a relatively advanced topic. Although we won't use it much in this book, it is useful to know the basics. Also, mastering OOP is often essential when you start to have a large code base. In Python, everything is an object. A number, a string, or a function is an object. An object is an instance of a type (also known as class). An object has attributes and methods, as specified by its type. An attribute is a variable bound to an object, giving some information about it. A method is a function that applies to the object. For example, the object 'hello' is an instance of the built-in str type (string). The type() function returns the type of an object, as shown here: In [35]: type('hello') Out[35]: str There are native types, like str or int (integer), and custom types, also called classes, that can be created by the user. In IPython, you can discover the attributes and methods of any object with the dot syntax and tab completion. For example, typing 'hello'.u and pressing Tab automatically shows us the existence of the upper() method: In [36]: 'hello'.upper() Out[36]: 'HELLO' Here, upper() is a method available to all str objects; it returns an uppercase copy of a string. A useful string method is format(). This simple and convenient templating system lets you generate strings dynamically, as shown in the following example: In [37]: 'Hello {0:s}!'.format('Python') Out[37]: Hello Python The {0:s} syntax means "replace this with the first argument of format(), which should be a string". The variable type after the colon is especially useful for numbers, where you can specify how to display the number (for example, .3f to display three decimals). The 0 makes it possible to replace a given value several times in a given string. You can also use a name instead of a position—for example 'Hello {name}!'.format(name='Python'). Some methods are prefixed with an underscore _; they are private and are generally not meant to be used directly. IPython's tab completion won't show you these private attributes and methods unless you explicitly type _ before pressing Tab. In practice, the most important thing to remember is that appending a dot . to any Python object and pressing Tab in IPython will show you a lot of functionality pertaining to that object. Functional programming Python is a multi-paradigm language; it notably supports imperative, object-oriented, and functional programming models. Python functions are objects and can be handled like other objects. In particular, they can be passed as arguments to other functions (also called higher-order functions). This is the essence of functional programming. Decorators provide a convenient syntax construct to define higher-order functions. Here is an example using the is_even() function from the previous Functions section: In [38]: def show_output(func): def wrapped(*args, **kwargs): output = func(*args, **kwargs) print("The result is:", output) return wrapped The show_output() function transforms an arbitrary function func() to a new function, named wrapped(), that displays the result of the function, as follows: In [39]: f = show_output(is_even) f(3) Out[39]: The result is: False Equivalently, this higher-order function can also be used with a decorator, as follows: In [40]: @show_output def square(x): return x * x In [41]: square(3) Out[41]: The result is: 9 You can find more information about Python decorators at https://en.wikipedia.org/wiki/Python_syntax_and_semantics#Decorators and at http://www.thecodeship.com/patterns/guide-to-python-function-decorators/. Python 2 and 3 Let's finish this section with a few notes about Python 2 and Python 3 compatibility issues. There are still some Python 2 code and libraries that are not compatible with Python 3. Therefore, it is sometimes useful to be aware of the differences between the two versions. One of the most obvious differences is that print is a statement in Python 2, whereas it is a function in Python 3. Therefore, print "Hello" (without parentheses) works in Python 2 but not in Python 3, while print("Hello") works in both Python 2 and Python 3. There are several non-mutually exclusive options to write portable code that works with both versions: futures: A built-in module supporting backward-incompatible Python syntax 2to3: A built-in Python module to port Python 2 code to Python 3 six: An external lightweight library for writing compatible code Here are a few references: Official Python 2/3 wiki page at https://wiki.python.org/moin/Python2orPython3 The Porting to Python 3 book, by CreateSpace Independent Publishing Platform at http://www.python3porting.com/bookindex.html 2to3 at https://docs.python.org/3.4/library/2to3.html six at https://pythonhosted.org/six/ futures at https://docs.python.org/3.4/library/__future__.html The IPython Cookbook contains an in-depth recipe about choosing between Python 2 and 3, and how to support both. Going beyond the basics You now know the fundamentals of Python, the bare minimum that you will need in this book. As you can imagine, there is much more to say about Python. Following are a few further basic concepts that are often useful and that we cannot cover here, unfortunately. You are highly encouraged to have a look at them in the references given at the end of this section: range and enumerate pass, break, and, continue, to be used in loops Working with files Creating and importing modules The Python standard library provides a wide range of functionality (OS, network, file systems, compression, mathematics, and more) Here are some slightly more advanced concepts that you might find useful if you want to strengthen your Python skills: Regular expressions for advanced string processing Lambda functions for defining small anonymous functions Generators for controlling custom loops Exceptions for handling errors with statements for safely handling contexts Advanced object-oriented programming Metaprogramming for modifying Python code dynamically The pickle module for persisting Python objects on disk and exchanging them across a network Finally, here are a few references: Getting started with Python: https://www.python.org/about/gettingstarted/ A Python tutorial: https://docs.python.org/3/tutorial/index.html The Python Standard Library: https://docs.python.org/3/library/index.html Interactive tutorial: http://www.learnpython.org/ Codecademy Python course: http://www.codecademy.com/tracks/python Language reference (expert level): https://docs.python.org/3/reference/index.html Python Cookbook, by David Beazley and Brian K. Jones, O'Reilly Media (advanced level, highly recommended if you want to become a Python expert) Summary In this article, we have seen how to launch the IPython console and Jupyter Notebook, the different aspects of the Notebook and its user interface, the structure of the notebook cell, keyboard shortcuts that are available in the Notebook interface, and the basics of Python. Introduction to Data Analysis and Libraries Hand Gesture Recognition Using a Kinect Depth Sensor The strange relationship between objects, functions, generators and coroutines

0
0
66789

article-image-first-principle-and-useful-way-think

Packt

08 Oct 2015

8 min read

First Principle and a Useful Way to Think

Packt

08 Oct 2015

8 min read

In this article, by Timothy Washington, author of the book Clojure for Finance, we will cover the following topics: Modeling the stock price activity Function evaluation First-Class functions Lazy evaluation Basic Clojure functions and immutability Namespace modifications and creating our first function (For more resources related to this topic, see here.) Modeling the stock price activity There are many types of banks. Commercial entities (large stores, parking areas, hotels, and so on) that collect and retain credit card information, are either quasi banks, or farm out credit operations to bank-like processing companies. There are more well-known consumer banks, which accept demand deposits from the public. There are also a range of other banks such as commercial banks, insurance companies and trusts, credit unions, and in our case, investment banks. As promised, this article will slowly build up a set of lagging price indicators that follow a moving stock price time series. In order to do that, I think it's useful to touch on stock markets, and to crudely model stock price activity. A stock (or equity) market, is a collection of buyers and sellers trading economic assets (usually companies). The stock (or shares) of those companies can be equities listed on an exchange (New York Stock Exchange, London Stock Exchange, and others), or may be those traded privately. In this exercise, we will do the following: Crudely model the stock price movement, which will give us a test bed for writing our lagging price indicators Introduce some basic features of the Clojure language Function evaluation The Clojure website has a cheatsheet (http://clojure.org/cheatsheet) with all of the language's core functions. The first function we'll look at is rand, a function that randomly gives you a number within a given range. So in your edgar/ project, launch a repl with the lein repl shell command. After a few seconds, you will enter repl (Read-Eval-Print-Loop). Again, Clojure functions are executed by being placed in the first position of a list. The function's arguments are placed directly afterwards. In your repl, evaluate (rand 35) or (rand 99) or (rand 43.21) or any number you fancy Run it many times to see that you can get any different floating point number, within 0 and the upper bound of the number you provided First-Class functions The next functions we'll look at are repeatedly and fn. repeatedly is a function that takes another function and returns an infinite (or length n if supplied) lazy sequence of calls to the latter function. This is our first encounter of a function that can take another function. We'll also encounter functions that return other functions. Described as First-Class functions, this falls out of lambda calculus and is one of the central features of functional programming. As such, we need to wrap our previous (rand 35) call in another function. fn is one of Clojure's core functions, and produces an anonymous, unnamed function. We can now supply this function to repeatedly. In your repl, if you evaluate (take 25 (repeatedly (fn [] (rand 35)))), you should see a long list of floating point numbers with the list's tail elided. Lazy evaluation We only took the first 25 of the (repeatedly (fn [] (rand 35))) result list, because the list (actually a lazy sequence) is infinite. Lazy evaluation (or laziness) is a common feature in functional programming languages. Being infinite, Clojure chooses to delay evaluating most of the list until it's needed by some other function that pulls out some values. Laziness benefits us by increasing performance and letting us more easily construct control flow. We can avoid needless calculation, repeated evaluations, and potential error conditions in compound expressions. Let's try to pull out some values with the take function. take itself, returns another lazy sequence, of the first n items of a collection. Evaluating (take 25 (repeatedly (fn [] (rand 35)))) will pull out the first 25 repeatedly calls to rand which generates a float between 0 and 35. Basic Clojure functions and immutability There's many operations we can perform over our result list (or lazy sequence). One of the main approaches of functional programming is to take a data structure, and perform operations over top of it to produce a new data structure, or some atomic result (a string, number, and so on). This may sound inefficient at first. But most FP languages employ something called immutability to make these operations efficient. Immutable data structures are the ones that cannot change once they've been created. This is feasible as most immutable, FP languages use some kind of structural data sharing between an original and a modified version. The idea is that if we run evaluate (conj [1 2 3] 4), the resulting [1 2 3 4] vector shares the original vector of [1 2 3]. The only additional resource that's assigned is for any novelty that's been introduced to the data structure (the 4). There's a more detailed explanation of (for example) Clojure's persistent vectors here: conj: This conjoins an element to a collection—the collection decides where. So conjoining an element to a vector (conj [1 2 3] 4) versus conjoining an element to a list (conj '(1 2 3) 4) yield different results. Try it in your repl. map: This passes a function over one or many lists, yielding another list. (map inc [1 2 3]) increments each element by 1. reduce (or left fold): This passes a function over each element, accumulating one result. (reduce + (take 100 (repeatedly (fn [] (rand 35))))) sums the list. filter: This constrains the input by some condition. >=: This is a conditional function, which tests whether the first argument is greater than or equal to the second function. Try (>= 4 9) and (>= 9 1). fn: This is a function that creates a function. This unnamed or anonymous function can have any instructions you choose to put in there. So if we only want numbers above 12, we can put that assertion in a predicate function. Try entering the below expression into your repl: (take 25 (filter (fn [x] (>= x 12)) (repeatedly (fn [] (rand 35))))) Modifying the namespaces and creating our first function We now have the basis for creating a function. It will return a lazy infinite sequence of floating point numbers, within an upper and lower bound. defn is a Clojure function, which takes an anonymous function, and binds a name to it in a given namespace. A Clojure namespace is an organizational tool for mapping human-readable names to things like functions, named data structures and such. Here, we're going to bind our function to the name generate-prices in our current namespace. You'll notice that our function is starting to span multiple lines. This will be a good time to author the code in your text editor of choice. I'll be using Emacs: Open your text editor, and add this code to the file called src/edgar/core.clj. Make sure that (ns edgar.core) is at the top of that file. After adding the following code, you can then restart repl. (load "edgaru/core") uses the load function to load the Clojure code in your in src/edgaru/core.clj: (defn generate-prices [lower-bound upper-bound] (filter (fn [x] (>= x lower-bound)) (repeatedly (fn [] (rand upper-bound))))) The Read-Eval-Print-Loop In our repl, we can pull in code in various namespaces, with the require function. This applies to the src/edgar/core.clj file we've just edited. That code will be in the edgar.core namespace: In your repl, evaluate (require '[edgar.core :as c]). c is just a handy alias we can use instead of the long name. You can then generate random prices within an upper and lower bound. Take the first 10 of them like this (take 10 (c/generate-prices 12 35)). You should see results akin to the following output. All elements should be within the range of 12 to 35: (29.60706184716407 12.507593971664075 19.79939384292759 31.322074615579716 19.737852534147326 25.134649707849572 19.952195022152488 12.94569843904663 23.618693004455086 14.695872710062428) There's a subtle abstraction in the preceding code that deserves attention. (require '[edgar.core :as c]) introduces the quote symbol. ' is the reader shorthand for the quote function. So the equivalent invocation would be (require (quote [edgar.core :as c])). Quoting a form tells the Clojure reader not to evaluate the subsequent expression (or form). So evaluating '(a b c) returns a list of three symbols, without trying to evaluate a. Even though those symbols haven't yet been assigned, that's okay, because that expression (or form) has not yet been evaluated. But that begs a larger question. What is reader? Clojure (and all Lisps) are what's known as homoiconic. This means that Clojure code is also data. And data can be directly output and evaluated as code. The reader is the thing that parses our src/edgar/core.clj file (or (+ 1 1) input from the repl prompt), and produces the data structures that are evaluated. read and eval are the 2 essential processes by which Clojure code runs. The evaluation result is printed (or output), usually to the standard output device. Then we loop the process back to the read function. So, when the repl reads, your src/edgar/two.clj file, it's directly transforming that text representation into data and evaluating it. A few things fall out of that. For example, it becomes simpler for Clojure programs to directly read, transform and write out other Clojure programs. The implications of that will become clearer when we look at macros. But for now, know that there are ways to modify or delay the evaluation process, in this case by quoting a form. Summary In this article, we learned about basic features of the Clojure language and how to model the stock price activity. Besides these, we also learned function evaluation, First-Class functions, the lazy evaluation method, namespace modifications and creating our first function. Resources for Article: Further resources on this subject: Performance by Design[article] Big Data[article] The Observer Pattern [article]

0
0
1091

article-image-integrating-elasticsearch-hadoop-ecosystem

Packt

07 Oct 2015

14 min read

Integrating Elasticsearch with the Hadoop ecosystem

Packt

07 Oct 2015

14 min read

0
0
3237

article-image-introduction-data-analysis-and-libraries

Packt

07 Oct 2015

13 min read

Introduction to Data Analysis and Libraries

Packt

07 Oct 2015

13 min read

In this article by Martin Czygan and Phuong Vothihong, the authors of the book Getting Started with Python Data Analysis, Data is raw information that can exist in any form, which is either usable or not. We can easily get data everywhere in our life; for example, the price of gold today is $ 1.158 per ounce. It does not have any meaning, except describing the gold price. This also shows that data is useful based on context. (For more resources related to this topic, see here.) With relational data connection, information appears and allows us to expand our knowledge beyond the range of our senses. When we possess gold price data gathered overtime, one information we might have is that the price has continuously risen from $1.152 to $1.158 for three days. It is used by someone who tracks gold prices. Knowledge helps people to create value in their lives and work. It is based on information that is organized, synthesized, or summarized to enhance comprehension, awareness, or understanding. It represents a state or potential for action and decisions. When the gold price continuously increases for three days, it will lightly decrease on the next day; this is useful knowledge. The following figure illustrates the steps from data to knowledge; we call this process the data analysis process and we will introduce it in the next section: In this article, we will cover the following topics: Data analysis and process Overview of libraries in data analysis using different programming languages Common Python data analysis libraries Data analysis and process Data is getting bigger and more diversified every day. Therefore, analyzing and processing data to advance human knowledge or to create value are big challenges. To tackle these challenges, you will need domain knowledge and a variety of skills, drawing from areas such as computer science, artificial intelligence (AI) and machine learning (ML), statistics and mathematics, and knowledge domain, as shown in the following figure: Let's us go through the Data analysis and it's domain knowledge: Computer science: We need this knowledge to provide abstractions for efficient data processing. A basic Python programming experience is required. We will introduce Python libraries used in data analysis. Artificial intelligence and machine learning: If computer science knowledge helps us to program data analysis tools, artificial intelligence and machine learning help us to model the data and learn from it in order to build smart products. Statistics and mathematics: We cannot extract useful information from raw data if we do not use statistical techniques or mathematical functions. Knowledge domain: Besides technology and general techniques, it is important to have an insight into the specific domain. What do the data fields mean? What data do we need to collect? Based on the expertise, we explore and analyze raw data by applying the above techniques, step by step. Data analysis is a process composed of the following steps: Data requirements: We have to define what kind of data will be collected based on the requirements or problem analysis. For example, if we want to detect a user's behavior while reading news on the internet, we should be aware of visited article links, date and time, article categories, and the user's time spent on different pages. Data collection: Data can be collected from a variety of sources: mobile, personal computer, camera, or recording device. It may also be obtained through different ways: communication, event, and interaction between person and person, person and device, or device and device. Data appears whenever and wherever in the world. The problem is, how can we find and gather it to solve our problem? This is the mission of this step. Data processing: Data that is initially obtained must be processed or organized for analysis. This process is performance-sensitive: How fast can we create, insert, update, or query data? For building a real product that has to process big data, we should consider this step carefully. What kind of database should we use to store data? What kind of data structure, such as analysis, statistics, or visualization, is suitable for our purposes? Data cleaning: After being processed and organized, the data may still contain duplicates or errors. Therefore, we need a cleaning step to reduce those situations and increase the quality of the results in the following steps. Common tasks include record matching, deduplication, or column segmentation. Depending on the type of data, we can apply several types of data cleaning. For example, a user's history of a visited news website might contain a lot of duplicate rows, because the user might have refreshed certain pages many times. For our specific issue, these rows might not carry any meaning when we explore the user's behavior. So, we should remove them before saving it to our database. Another situation we may encounter is click fraud on news—someone just wants to improve their website ranking or sabotage the website. In this case, the data will not help us to explore a user's behavior. We can use thresholds to check whether a visit page event comes from a real person or from malicious software. Exploratory data analysis: Now, we can start to analyze data through a variety of techniques referred to as exploratory data analysis. We may detect additional problems in data cleaning or discover requests for further data. Therefore, these steps may be iterative and repeated throughout the whole data analysis process. Data visualization techniques are also used to examine the data in graphs or charts. Visualization often facilitates the understanding of data sets, especially, if they are large or high-dimensional. Modelling and algorithms: A lot of mathematical formulas and algorithms may be applied to detect or predict useful knowledge from the raw data. For example, we can use similarity measures to cluster users who have exhibited similar news reading behavior and recommend articles of interest to them next time. Alternatively, we can detect users' gender based on their news reading behavior by applying classification models such as Support Vector Machine (SVM) or linear regression. Depending on the problem, we may use different algorithms to get an acceptable result. It can take a lot of time to evaluate the accuracy of the algorithms and to choose the best one to implement for a certain product. Data product: The goal of this step is to build data products that receive data input and generate output according to the problem requirements. We will apply computer science knowledge to implement our selected algorithms as well as manage the data storage. Overview of libraries in data analysis There are numerous data analysis libraries that help us to process and analyze data. They use different programming languages and have different advantages as well as disadvantages of solving various data analysis problems. Now, we introduce some common libraries that may be useful for you. They should give you an overview of libraries in the field. However, the rest of this focuses on Python-based libraries. Some of the libraries that use the Java language for data analysis are as follows: Weka: This is the library that I got familiar with, the first time I learned about data analysis. It has a graphical user interface that allows you to run experiments on a small dataset. This is great if you want to get a feel for what is possible in the data processing space. However, if you build a complex product, I think it is not the best choice because of its performance: sketchy API design, non-optimal algorithms, and little documentation (http://www.cs.waikato.ac.nz/ml/weka/). Mallet: This is another Java library that is used for statistical natural language processing, document classification, clustering, topic modelling, information extraction, and other machine learning applications on text. There is an add-on package to Mallet, called GRMM, that contains support for inference in general, graphical models, and training of Conditional random fields (CRF) with arbitrary graphical structure. In my experience, the library performance as well as the algorithms are better than Weka. However, its only focus is on text processing problems. The reference page is at http://mallet.cs.umass.edu/. Mahout: This is Apache's machine learning framework built on top of Hadoop; its goal is to build a scalable machine learning library. It looks promising, but comes with all the baggage and overhead of Hadoop. The Homepage is at http://mahout.apache.org/. Spark: This is a relatively new Apache project; supposedly up to a hundred times faster than Hadoop. It is also a scalable library that consists of common machine learning algorithms and utilities. Development can be done in Python as well as in any JVM language. The reference page is at https://spark.apache.org/docs/1.5.0/mllib-guide.html. Here are a few libraries that are implemented in C++: Vowpal Wabbit: This library is a fast out-of-core learning system sponsored by Microsoft Research and (previously) Yahoo! Research. It has been used to learn a tera-feature (1012) dataset on 1000 nodes in one hour. More information can be found in the publication at http://arxiv.org/abs/1110.4198. MultiBoost: This package is a multiclass, multilabel, and multitask classification boosting software implemented in C++. If you use this software, you should refer to the paper published in 2012, in the Journal of Machine Learning Research, MultiBoost: A Multi-purpose Boosting Package, D.Benbouzid, R. Busa-Fekete, N. Casagrande, F.-D. Collin, and B. Kégl. MLpack: This is also a C++ machine learning library, developed by the Fundamental Algorithmic and Statistical Tools Laboratory (FASTLab) at Georgia Tech. It focusses on scalability, speed, and ease-of-use and was presented at the BigLearning workshop of NIPS 2011. Its homepage is at http://www.mlpack.org/about.html. Caffe: The last C++ library we want to mention is Caffe. This is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and community contributors. You can find more information about it at http://caffe.berkeleyvision.org/. Other libraries for data processing and analysis are as follows: Statsmodels: This is a great Python library for statistical modelling and is mainly used for predictive and exploratory analysis. Modular toolkit for data processing (MDP):This is a collection of supervised and unsupervised learning algorithms and other data processing units that can be combined into data processing sequences and more complex feed-forward network architectures (http://mdp-toolkit.sourceforge.net/index.html). Orange: This is an open source data visualization and analysis for novices and experts. It is packed with features for data analysis and has add-ons for bioinformatics and text mining. It contains an implementation of self-organizing maps, which sets it apart from the other projects as well (http://orange.biolab.si/). Mirador: This is a tool for the visual exploration of complex datasets supporting Mac and Windows. It enables users to discover correlation patterns and derive new hypotheses from data (http://orange.biolab.si/). RapidMiner: This is another GUI-based tool for data mining, machine learning, and predictive analysis (https://rapidminer.com/). Theano: This bridges the gap between Python and lower-level languages. Theano gives very significant performance gains, particularly for large matrix operations and is, therefore, a good choice for deep learning models. However, it is not easy to debug because of the additional compilation layer. Natural language processing toolkit (NLTK): This is written in Python with very unique and salient features. Here, I could not list all libraries for data analysis. However, I think the above libraries are enough to take a lot of your time to learn and build data analysis applications. Python libraries in data analysis Python is a multi-platform, general purpose programming language that can run on Windows, Linux/Unix, and Mac OS X, and has been ported to the Java and the .NET virtual machines as well. It has a powerful standard library. In addition, it has many libraries for data analysis: Pylearn2, Hebel, Pybrain, Pattern, MontePython, and MILK. We will cover some common Python data analysis libraries such as Numpy, Pandas, Matplotlib, PyMongo, and scikit-learn. Now, to help you getting started, I will briefly present an overview of each library for those who are less familiar with the scientific Python stack. NumPy One of the fundamental packages used for scientific computing with Python is Numpy. Among other things, it contains the following: A powerful N-dimensional array object Sophisticated (broadcasting) functions for performing array computations Tools for integrating C/C++ and Fortran code Useful linear algebra operations, Fourier transforms, and random number capabilities. Besides this, it can also be used as an efficient multidimensional container of generic data. Arbitrary data types can be defined and integrated with a wide variety of databases. Pandas Pandas is a Python package that supports rich data structures and functions for analyzing data and is developed by the PyData Development Team. It is focused on the improvement of Python's data libraries. Pandas consists of the following things: A set of labelled array data structures; the primary of which are Series, DataFrame, and Panel Index objects enabling both simple axis indexing and multilevel/hierarchical axis indexing An integrated group by engine for aggregating and transforming datasets Date range generation and custom date offsets Input/output tools that loads and saves data from flat files or PyTables/HDF5 format Optimal memory versions of the standard data structures Moving window statistics and static and moving window linear/panel regression Because of these features, Pandas is an ideal tool for systems that need complex data structures or high-performance time series functions such as financial data analysis applications. Matplotlib Matplotlib is the single most used Python package for 2D-graphic. It provides both a very quick way to visualize data from Python and publication-quality figures in many formats: line plots, contour plots, scatter plots, or Basemap plot. It comes with a set of default settings, but allows customizing all kinds of properties. However, we can easily create our chart with the defaults of almost every property in Matplotlib. PyMongo MongoDB is a type of NoSQL database. It is highly scalable, robust, and perfect to work with JavaScript-based web applications because we can store data as JSON documents and use flexible schemas. PyMongo is a Python distribution containing tools for working with MongoDB. Many tools have also been written for working with PyMongo to add more features such as MongoKit, Humongolus, MongoAlchemy, and Ming. scikit-learn scikit-learn is an open source machine learning library using the Python programming language. It supports various machine learning models, such as classification, regression, and clustering algorithms, interoperated with the Python numerical and scientific libraries NumPy and SciPy. The latest scikit-learn version is 0.16.1, published in April 2015. Summary In this article, there were three main points that we presented. Firstly, we figured out the relationship between raw data, information and knowledge. Because of its contribution in our life, we continued to discuss an overview of data analysis and processing steps in the second part. Finally, we introduced a few common supported libraries that are useful for practical data analysis applications. Among those we will focus on Python libraries in data analysis. Resources for Article: Further resources on this subject: Exploiting Services with Python [Article] Basics of Jupyter Notebook and Python [Article] How to do Machine Learning with Python [Article]

0
0
7998

article-image-fingerprint-detection-using-opencv

Packt

07 Oct 2015

11 min read

Fingerprint detection using OpenCV 3

Packt

07 Oct 2015

11 min read

In this article by Joseph Howse, Quan Hua, Steven Puttemans, and Utkarsh Sinha, the authors of OpenCV Blueprints, we delve into the aspect of fingerprint detection using OpenCV. (For more resources related to this topic, see here.) Fingerprint identification, how is it done? We have already discussed the use of the first biometric, which is the face of the person trying to login to the system. However since we mentioned that using a single biometric can be quite risky, we suggest adding secondary biometric checks to the system, like the fingerprint of a person. There are many of the shelf fingerprint scanners which are quite cheap and return you the scanned image. However you will still have to write your own registration software for these scanners and which can be done by using OpenCV software. Examples of such fingerprint images can be found below. Examples of single individual thumb fingerprint in different scanning positions This dataset can be downloaded from the FVC2002 competition website released by the University of Bologna. The website (http://bias.csr.unibo.it/fvc2002/databases.asp) contains 4 databases of fingerprints available for public download of the following structure: Four fingerprint capturing devices DB1 - DB4. For each device, the prints of 10 individuals are available. For each person, 8 different positions of prints were recorded. We will use this publicly available dataset to build our system upon. We will focus on the first capturing device, using up to 4 fingerprints of each individual for training the system and making an average descriptor of the fingerprint. Then we will use the other 4 fingerprints to evaluate our system and make sure that the person is still recognized by our system. You could apply exactly the same approach on the data grabbed from the other devices if you would like to investigate the difference between a system that captures almost binary images and one that captures grayscale images. However we will provide techniques for doing the binarization yourself. Implement the approach in OpenCV 3 The complete fingerprint software for processing fingerprints derived from a fingerprint scanner can be found at https://github.com/OpenCVBlueprints/OpenCVBlueprints/tree/master/chapter_6/source_code/fingerprint/fingerprint_process/. In this subsection we will describe how you can implement this approach in the OpenCV interface. We will start by grabbing the image from the fingerprint system and apply binarization. This will enable us to remove any desired noise from the image as well as help us to make the contrast better between the kin and the wrinkled surface of the finger. // Start by reading in an image Mat input = imread("/data/fingerprints/image1.png", CV_LOAD_GRAYSCALE); // Binarize the image, through local thresholding Mat input_binary; threshold(input, input_binary, 0, 255, CV_THRESH_BINARY | CV_THRESH_OTSU); The Otsu thresholding will automatically choose the best generic threshold for the image to obtain a good contrast between foreground and background information. This is because the image contains a bimodal distribution (which means that we have an image with a 2 peak histogram) of pixel values. For that image, we can approximately take a value in the middle of those peaks as threshold value. (for images which are not bimodal, binarization won't be accurate.) Otsu allows us to avoid using a fixed threshold value, and thus making the system more general for any capturing device. However we do acknowledge that if you have only a single capturing device, then playing around with a fixed threshold value could result in a better image for that specific setup. The result of the thresholding can be seen below. In order to make thinning from the next skeletization step as effective as possible we need the inverse binary image. Comparison between grayscale and binarized fingerprint image Once we have a binary image we are actually already set to go to calculate our feature points and feature point descriptors. However, in order to improve the process a bit more, we suggest to skeletize the image. This will create more unique and stronger interest points. The following piece of code can apply the skeletization on top of the binary image. The skeletization is based on the Zhang-Suen line thinning approach. Special thanks to @bsdNoobz of the OpenCV Q&A forum who supplied this iteration approach. #include <opencv2/imgproc.hpp> #include <opencv2/highgui.hpp> using namespace std; using namespace cv; // Perform a single thinning iteration, which is repeated until the skeletization is finalized void thinningIteration(Mat& im, int iter) { Mat marker = Mat::zeros(im.size(), CV_8UC1); for (int i = 1; i < im.rows-1; i++) { for (int j = 1; j < im.cols-1; j++) { uchar p2 = im.at<uchar>(i-1, j); uchar p3 = im.at<uchar>(i-1, j+1); uchar p4 = im.at<uchar>(i, j+1); uchar p5 = im.at<uchar>(i+1, j+1); uchar p6 = im.at<uchar>(i+1, j); uchar p7 = im.at<uchar>(i+1, j-1); uchar p8 = im.at<uchar>(i, j-1); uchar p9 = im.at<uchar>(i-1, j-1); int A = (p2 == 0 && p3 == 1) + (p3 == 0 && p4 == 1) + (p4 == 0 && p5 == 1) + (p5 == 0 && p6 == 1) + (p6 == 0 && p7 == 1) + (p7 == 0 && p8 == 1) + (p8 == 0 && p9 == 1) + (p9 == 0 && p2 == 1); int B = p2 + p3 + p4 + p5 + p6 + p7 + p8 + p9; int m1 = iter == 0 ? (p2 * p4 * p6) : (p2 * p4 * p8); int m2 = iter == 0 ? (p4 * p6 * p8) : (p2 * p6 * p8); if (A == 1 && (B >= 2 && B <= 6) && m1 == 0 && m2 == 0) marker.at<uchar>(i,j) = 1; } } im &= ~marker; } // Function for thinning any given binary image within the range of 0-255. If not you should first make sure that your image has this range preset and configured! void thinning(Mat& im) { // Enforce the range tob e in between 0 - 255 im /= 255; Mat prev = Mat::zeros(im.size(), CV_8UC1); Mat diff; do { thinningIteration(im, 0); thinningIteration(im, 1); absdiff(im, prev, diff); im.copyTo(prev); } while (countNonZero(diff) > 0); im *= 255; } The code above can then simply be applied to our previous steps by calling the thinning function on top of our previous binary generated image. The code for this is: Apply thinning algorithm Mat input_thinned = input_binary.clone(); thinning(input_thinned); This will result in the following output: Comparison between binarized and thinned fingerprint image using skeletization techniques. When we got this skeleton image, the following step would be to look for crossing points on the ridges of the fingerprint, which are being then called minutiae points. We can do this by a keypoint detector that looks at a large change in local contrast, like the Harris corner detector. Since the Harris corner detector is both able to detect strong corners and edges, this is ideally for the fingerprint problem, where the most important minutiae are short edges and bifurcation, the positions where edges come together. More information about minutae points and Harris Corner detection can be found in the following publications: Ross, Arun A., Jidnya Shah, and Anil K. Jain. "Toward reconstructing fingerprints from minutiae points." Defense and Security. International Society for Optics and Photonics, 2005. Harris, Chris, and Mike Stephens. "A combined corner and edge detector." Alvey vision conference. Vol. 15. 1988. Calling the Harris Corner operation on a skeletonized and binarized image in OpenCV is quite straightforward. The Harris corners are stored as positions corresponding in the image with their cornerness response value. If we want to detect points with a certain cornerness, than we should simply threshold the image. Mat harris_corners, harris_normalised; harris_corners = Mat::zeros(input_thinned.size(), CV_32FC1); cornerHarris(input_thinned, harris_corners, 2, 3, 0.04, BORDER_DEFAULT); normalize(harris_corners, harris_normalised, 0, 255, NORM_MINMAX, CV_32FC1, Mat()); We now have a map with all the available corner responses rescaled to the range of [0 255] and stored as float values. We can now manually define a threshold, that will generate a good amount of keypoints for our application. Playing around with this parameter could improve performance in other cases. This can be done by using the following code snippet: float threshold = 125.0; vector<KeyPoint> keypoints; Mat rescaled; convertScaleAbs(harris_normalised, rescaled); Mat harris_c(rescaled.rows, rescaled.cols, CV_8UC3); Mat in[] = { rescaled, rescaled, rescaled }; int from_to[] = { 0,0, 1,1, 2,2 }; mixChannels( in, 3, &harris_c, 1, from_to, 3 ); for(int x=0; x<harris_normalised.cols; x++){ for(int y=0; y<harris_normalised.rows; y++){ if ( (int)harris_normalised.at<float>(y, x) > threshold ){ // Draw or store the keypoint location here, just like you decide. In our case we will store the location of the keypoint circle(harris_c, Point(x, y), 5, Scalar(0,255,0), 1); circle(harris_c, Point(x, y), 1, Scalar(0,0,255), 1); keypoints.push_back( KeyPoint (x, y, 1) ); } } } Comparison between thinned fingerprint and Harris corner response, as well as the selected Harris corners. Now that we have a list of keypoints we will need to create some of formal descriptor of the local region around that keypoint to be able to uniquely identify it among other keypoints. Chapter 3, Recognizing facial expressions with machine learning, discusses in more detail the wide range of keypoints out there. In this article, we will mainly focus on the process. Feel free to adapt the interface with other keypoint detectors and descriptors out there, for better or for worse performance. Since we have an application where the orientation of the thumb can differ (since it is not a fixed position), we want a keypoint descriptor that is robust at handling these slight differences. One of the mostly used descriptors for that is the SIFT descriptor, which stands for scale invariant feature transform. However SIFT is not under a BSD license and can thus pose problems to use in commercial software. A good alternative in OpenCV is the ORB descriptor. In OpenCV you can implement it in the following way. Ptr<Feature2D> orb_descriptor = ORB::create(); Mat descriptors; orb_descriptor->compute(input_thinned, keypoints, descriptors); This will enable us to calculate only the descriptors using the ORB approach, since we already retrieved the location of the keypoints using the Harris corner approach. At this point we can retrieve a descriptor for each detected keypoint of any given fingerprint. The descriptors matrix will contain a row for each keypoint containing the representation. Let us now start from the case where we have only a single reference image for each fingerprint. In that case we will have a database containing a set of feature descriptors for the training persons in the database. We then have a single new entry, consisting of multiple descriptors for the keypoints found at registration time. We now have to match these descriptors to the descriptors stored in the database, to see which one has the best match. The most simple way is by performing a brute force matching using the hamming distance criteria between descriptors of different keypoints. // Imagine we have a vector of single entry descriptors as a database // We will still need to fill those once we compare everything, by using the code snippets above vector<Mat> database_descriptors; Mat current_descriptors; // Create the matcher interface Ptr<DescriptorMatcher> matcher = DescriptorMatcher::create("BruteForce-Hamming"); // Now loop over the database and start the matching vector< vector< DMatch > > all_matches; for(int entry=0; i<database_descriptors.size();entry++){ vector< DMatch > matches; matcheràmatch(database_descriptors[entry], current_descriptors, matches); all_matches.push_back(matches); } We now have all the matches stored as DMatch objects. This means that for each matching couple we will have the original keypoint, the matched keypoint and a floating point score between both matches, representing the distance between the matched points. The idea about finding the best match seems pretty straightforward. We take a look at the amount of matches that have been returned by the matching process and weigh them by their Euclidean distance in order to add some certainty. We then look for the matching process that yielded the biggest score. This will be our best match and the match we want to return as the selected one from the database. If you want to avoid an imposter getting assigned to the best matching score, you can again add a manual threshold on top of the scoring to avoid matches that are not good enough, to be ignored. However it is possible, and should be taken into consideration, that if you increase the score to high, that people with a little change will be rejected from the system, like for example in the case where someone cuts his finger and thus changing his pattern to drastically. Fingerprint matching process visualized. To summarize, we saw how to detect fingerprints and implement it using OpenCV 3. Resources for Article: Further resources on this subject: Making subtle color shifts with curves [article] Tracking Objects in Videos [article] Hand Gesture Recognition Using a Kinect Depth Sensor [article]

0
1
26860

article-image-creating-interactive-spreadsheets-using-tables-and-slicers

Packt

06 Oct 2015

10 min read

Creating Interactive Spreadsheets using Tables and Slicers

Packt

06 Oct 2015

10 min read

0
0
1893

article-image-hand-gesture-recognition-using-kinect-depth-sensor

Packt

06 Oct 2015

26 min read

Hand Gesture Recognition Using a Kinect Depth Sensor

Packt

06 Oct 2015

26 min read

0
0
13777

How-To Tutorials - Data

Configuring Brokers

Understanding Text Search and Hierarchies in SAP HANA

QlikView Tips and Tricks

SQL Server with PowerShell

An Overview of Oozie

Introducing Test-driven Machine Learning

Transactions and Operators

Securing Your Data

Basics of Jupyter Notebook and Python

First Principle and a Useful Way to Think

Trending Topics

Integrating Elasticsearch with the Hadoop ecosystem

Introduction to Data Analysis and Libraries

Fingerprint detection using OpenCV 3

Creating Interactive Spreadsheets using Tables and Slicers

Hand Gesture Recognition Using a Kinect Depth Sensor