Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-getting-started-predictive-analytics
Packt
04 Jul 2017
32 min read
Save for later

Getting Started with Predictive Analytics

Packt
04 Jul 2017
32 min read
In this article by Ralph Winters, the author of the book Practical Predictive Analytics we will explore the idea of how to start with predictive analysis. "In God we trust, all other must bring Data" – Deming (For more resources related to this topic, see here.) I enjoy explaining Predictive Analytics to people because it is based upon a simple concept:  Predicting the probability of future events based upon historical data. Its history may date back to at least 650 BC. Some early examples include the Babylonians, who tried to predict short term weather changes based cloud appearances and haloes Medicine also has a long history of a need to classify diseases.  The Babylonian king Adad-apla-iddina decreed that medical records be collected to form the “Diagnostic Handbook”. Some “predictions” in this corpus list treatments based on the number of days the patient had been sick, and their pulse rate. One of the first instances of bioinformatics! In later times, specialized predictive analytics were developed at the onset of the Insurance underwriting industries. This was used as a way to predict the risk associated with insuring Marine Vessels. At about the same time, Life Insurance companies began predicting the age that a person would live in order to set the most appropriate premium rates. [i]Although the idea of prediction always seemed to be rooted early in humans’ ability to want to understand and classify, it was not until the 20th century, and the advent of modern computing that it really took hold. In addition to aiding the US government in the 1940 with breaking the code, Alan Turing also worked on the initial computer chess algorithms which pitted man vs. machine.  Monte Carlo simulation methods originated as part of the Manhattan project, where mainframe computers crunched numbers for days in order to determine the probability of nuclear attacks. In the 1950’s Operation Research theory developed, in which one could optimize the shortest distance between two points. To this day, these techniques are used in logistics by companies such as UPS and Amazon. Non mathematicians have also gotten into the act.  In the 1970’s, Cardiologist Lee Goldman (who worked aboard a submarine) spend years developing a decision tree which did this efficiently.  This helped the staff determine whether or not the submarine needed to resurface in order to help the chest pain sufferer! What many of these examples had in common was that history was used to predict the future.  Along with prediction, came understanding of cause and effect and how the various parts of the problem were interrelated.  Discovery and insight came about through methodology and adhering to the scientific method. Most importantly, the solutions came about in order to find solutions to important, and often practical problems of the times.  That is what made them unique. Predictive Analytics adopted by some many different industries We have come a long way from then, and Practical Analytics solutions have furthered growth in so many different industries.  The internet has had a profound effect on this; it has enabled every click to be stored and analyzed. More data is being collected and stored, some with very little effort. That in itself has enabled more industries to enter Predictive Analytics. Marketing has always been concerned with customer acquisition and retention, and has developed predictive models involving various promotional offers and customer touch points, all geared to keeping customers and acquiring new ones.  This is very pronounced in certain industries, such as wireless and online shopping cards, in which customers are always searching for the best deal. Specifically, advanced analytics can help answer questions like "If I offer a customer 10% off with free shipping, will that yield more revenue than 15% off with no free shipping?".  The 360-degree view of the customer has expanded the number of ways one can engage with the customer, therefore enabling marketing mix and attribution modeling to become increasingly important.  Location based devices have enabled marketing predictive applications to incorporate real time data to issue recommendation to the customer while in the store. Predictive Analytics in Healthcare has its roots in clinical trials, which uses carefully selected samples to test the efficacy of drugs and treatments.  However, healthcare has been going beyond this. With the advent of sensors, data can be incorporated into predictive analytics to monitor patients with critical illness, and to send alerts to the patient when he is at risk. Healthcare companies can now predict which individual patients will comply with courses of treatment advocated by health providers.  This will send early warning signs to all parties, which will prevent future complications, as well as lower the total costs of treatment. Other examples can be found in just about every other industry.  Here are just a few: Finance:    Fraud detection is a huge area. Financial institutions can monitor clients internal and external transactions for fraud, through pattern recognition, and then alert a customer concerning suspicious activity.    Wall Street program trading. Trading algorithms will predict intraday highs and lows, and will decide when to buy and sell securities. Sports Management    Sports management are able to predict which sports events will yield the greatest attendance and institute variable ticket pricing based upon audience interest.     In baseball, a pitchers’ entire game can be recorded and then digitally analyzed. Sensors can also be attached to their arm, to alert when future injury might occur Higher Education    Colleges can predict how many, and which kind of students are likely to attend the next semester, and be able to plan resources accordingly.    Time based assessments of online modules can enable professors to identify students’ potential problems areas, and tailor individual instruction. Government    Federal and State Governments have embraced the open data concept, and have made more data available to the public. This has empowered “Citizen Data Scientists” to help solve critical social and government problems.    The potential use of using the data for the purpose of emergency service, traffic safety, and healthcare use is overwhelmingly positive. Although these industries can be quite different, the goals of predictive analytics are typically implement to increase revenue, decrease costs, or alter outcomes for the better. Skills and Roles which are important in Predictive Analytics So what skills do you need to be successful in Predictive Analytics? I believe that there are 3 basic skills that are needed: Algorithmic/Statistical/programming skills -These are the actual technical skills needed to implement a technical solution to a problem. I bundle these all together since these skills are typically used in tandem. Will it be a purely statistical solution, or will there need to be a bit of programming thrown in to customize an algorithm, and clean the data?  There are always multiple ways of doing the same task and it will be up to you, the predictive modeler to determine how it is to be done. Business skills –These are the skills needed for communicating thoughts and ideas among groups of all of the interested parties.  Business and Data Analysts who have worked in certain industries for long periods of time, and know their business very well, are increasingly being called upon to participate in Predictive Analytics projects.  Data Science is becoming a team sport, and most projects include working with others in the organization, Summarizing findings, and having good presentation and documentation skills are important.  You will often hear the term ‘Domain Knowledge’ associated with this, since it is always valuable to know the inner workings of the industry you are working in.  If you do not have the time or inclination to learn all about the inner workings of the problem at hand yourself, partner with someone who does. Data Storage / ETL skills:   This can refer to specialized knowledge regarding extracting data, and storing it in a relational, or non-relational NoSQL data store. Historically, these tasks were handled exclusively within a data warehouse.  But now that the age of Big Data is upon us, specialists have emerged who understand the intricacies of data storage, and the best way to organize it. Related Job skills and terms Along with the term Predictive Analytics, here are some terms which are very much related: Predictive Modeling:  This specifically means using a Mathematical/statistical model to predict the likelihood of a dependent or Target Variable Artificial Intelligence:  A broader term for how machines are able to rationalize and solve problems.  AI’s early days were rooted in Neural Networks Machine Learning- A subset of Artificial Intelligence. Specifically deals with how a machine learns automatically from data, usually to try to replicate human decision making or to best it. At this point, everyone knows about Watson, who beat two human opponents in “Jeopardy” [ii] Data Science - Data Science encompasses Predictive Analytics but also adds algorithmic development via coding, and good presentation skills via visualization. Data Engineering - Data Engineering concentrates on data extract and data preparation processes, which allow raw data to be transformed into a form suitable for analytics. A knowledge of system architecture is important. The Data Engineer will typically produce the data to be used by the Predictive Analysts (or Data Scientists) Data Analyst/Business Analyst/Domain Expert - This is an umbrella term for someone who is well versed in the way the business at hand works, and is an invaluable person to learn from in terms of what may have meaning, and what may not Statistics – The classical form of inference, typically done via hypothesis testing. Predictive Analytics Software Originally predictive analytics was performed by hand, by statisticians on mainframe computers using a progression of various language such as FORTRAN etc.  Some of these languages are still very much in use today.  FORTRAN, for example, is still one of the fasting performing languages around, and operators with very little memory. Nowadays, there are some many choices on which software to use, and many loyalists remain true to their chosen software.  The reality is, that for solving a specific type of predictive analytics problem, there exists a certain amount of overlap, and certainly the goal is the same. Once you get a hang of the methodologies used for predictive analytics in one software packages, it should be fairly easy to translate your skills to another package. Open Source Software Open source emphasis agile development, and community sharing.  Of course, open source software is free, but free must also be balance in the context of TCO (Total Cost of Ownership) R The R language is derived from the "S" language which was developed in the 1970’s.  However, the R language has grown beyond the original core packages to become an extremely viable environment for predictive analytics. Although R was developed by statisticians for statisticians, it has come a long way from its early days.  The strength of R comes from its 'package' system, which allows specialized or enhanced functionality to be developed and 'linked' to the core system. Although the original R system was sufficient for statistics and data mining, an important goal of R was to have its system enhanced via user written contributed packages.  As of this writing, the R system contains more than 8,000 packages.  Some are of excellent quality, and some are of dubious quality.  Therefore, the goal is to find the truly useful packages that add the most value.  Most, if not all of the R packages in use, address most of the common predictive analytics tasks that you will encounter.  If you come across a task that does not fit into any category, chances are good that someone in the R community has done something similar.  And of course, there is always a chance that someone is developing a package to do exactly what you want it to do.  That person could be eventually be you!. Closed Source Software Closed Source Software such as SAS and SPSS were on the forefront of predictive analytics, and have continued to this day to extend their reach beyond the traditional realm of statistics and machine learning.  Closed source software emphasis stability, better support, and security, with better memory management, which are important factors for some companies.  There is much debate nowadays regarding which one is 'better'.  My prediction is that they both will coexist peacefully, with one not replacing the other.  Data sharing and common API's will become more common.  Each has its place within the data architecture and ecosystem is deemed correct for a company.  Each company will emphasis certain factors, and both open and closed software systems and constantly improving themselves. Other helpful tools Man does not live by bread alone, so it would behoove you to learn additional tools in addition to R, so as to advance your analytic skills. SQL - SQL is a valuable tool to know, regardless of which language/package/environment you choose to work in. Virtually every analytics tool will have a SQL interface, and a knowledge of how to optimize SQL queries will definitely speed up your productivity, especially if you are doing a lot of data extraction directly from a SQL database. Today’s common thought is to do as much preprocessing as possible within the database, so if you will be doing a lot of extracting from databases like MySQL, Postgre, Oracle, or Teradata, it will be a good thing to learn how queries are optimized within their native framework. In the R language, there are several SQL packages that are useful for interfacing with various external databases.  We will be using SQLDF which is a popular R package for interfacing with R dataframes.  There are other packages which are specifically tailored for the specific database you will be working with Web Extraction Tools –Not every data source will originate from a data warehouse. Knowledge of API’s which extract data from the internet will be valuable to know. Some popular tools include Curl, and Jsonlite. Spreadsheets.  Despite their problems, spreadsheets are often the fastest way to do quick data analysis, and more importantly, enable them to share your results with others!  R offers several interface to spreadsheets, but again, learning standalone spreadsheet skills like PivotTables, and VBA will give you an advantage if you work for corporations in which these skills are heavily used. Data Visualization tools: Data Visualization tools are great for adding impact to an analysis, andfor concisely encapsulating complex information.  Native R visualization tools are great, but not every company will be using R.  Learn some third party visualization tools such as D3.js, Google Charts, Qlikview, or Tableau Big data Spark, Hadoop, NoSQL Database:  It is becoming increasingly important to know a little bit about these technologies, at least from the viewpoint of having to extract and analyze data which resides within these frameworks. Many software packages have API’s which talk directly to Hadoop and can run predictive analytics directly within the native environment, or extract data and perform the analytics locally. After you are past the basics Given that the Predictive Analytics space is so huge, once you are past the basics, ask yourself what area of Predictive analytics really interests you, and what you would like to specialize in.  Learning all you can about everything concerning Predictive Analytics is good at the beginning, but ultimately you will be called upon because you are an expert in certain industries or techniques. This could be research, algorithmic development, or even for managing analytics teams. But, as general guidance, if you are involved in, or are oriented towards data the analytics or research portion of data science, I would suggest that you concentrate on data mining methodologies and specific data modeling techniques which are heavily prevalent in the specific industries that interest you.  For example, logistic regression is heavily used in the insurance industry, but social network analysis is not. Economic research is geared towards time-series analysis, but not so much cluster analysis. If you are involved more on the data engineering side, concentrate more on data cleaning, being able to integrate various data sources, and the tools needed to accomplish this.  If you are a manager, concentrate on model development, testing and control, metadata, and presenting results to upper management in order to demonstrate value. Of course, predictive analytics is becoming more of a team sport, rather than a solo endeavor, and the Data Science team is very much alive.  There is a lot that has been written about the components of a Data Science team, much of it which can be reduced to the 3 basic skills that I outlined earlier. Two ways to look at predictive analytics Depending upon how you intend to approach a particular problem, look at how two different analytical mindsets can affect the predictive analytics process. Minimize prediction error goal: This is a very commonly used case within machine learning. The initial goal is to predict using the appropriate algorithms in order to minimize the prediction error. If done incorrectly, an algorithm will ultimately fail and it will need to be continually optimized to come up with the “new” best algorithm. If this is performed mechanically without regard to understanding the model, this will certainly result in failed outcomes.  Certain models, especially over optimized ones with many variables can have a very high prediction rate, but be unstable in a variety of ways. If one does not have an understanding of the model, it can be difficult to react to changes in the data inputs.  Understanding model goal: This came out of the scientific method and is tied closely with the concept of hypothesis testing.  This can be done in certain kinds of models, such as regression and decision trees, and is more difficult in other kinds of models such as SVM and Neural Networks.  In the understanding model paradigm, understanding causation or impact becomes more important than optimizing correlations. Typically, “Understanding” models have a lower prediction rate, but have the advantage of knowing more about the causations of the individual parts of the model, and how they are related. E.g. industries which rely on understanding human behavior emphasize model understanding goals.  A limitation to this orientation is that we might tend to discard results that are not immediately understood Of course the above examples illustrate two disparate approaches. Combination models, which use the best of both worlds should be the ones we should strive for.  A model which has an acceptable prediction error, is stable over, and is simple enough to understand. You will learn later that is this related to Bias/Variance Tradeoff R Installation R Installation is typically done by downloading the software directly from the CRAN site Navigate to https://cran.r-project.org/ Install the version of R appropriate for your operating system Alternate ways of exploring R Although installing R directly from the CRAN site is the way most people will proceed, I wanted to mention some alternative R installation methods. These methods are often good in instances when you are not always at your computer. Virtual Environment: Here are few methods to install R in the virtual environment: Virtual Box or VMware- Virtual environments are good for setting up protected environments and loading preinstalled operating systems and packages.  Some advantages are that they are good for isolating testing areas, and when you do not which to take up additional space on your own machine. Docker – Docker resembles a Virtual Machine, but is a bit more lightweight since it does not emulate an entire operating system, but emulates only the needed processes.  (See Rocker, Docker container) Cloud Based- Here are few methods to install R in the cloud based environment: AWS/Azure – These are Cloud Based Environments.  Advantages to this are similar to the reasons as virtual box, with the additional capability to run with very large datasets and with more memory.  Not free, but both AWS and Azure offer free tiers. Web Based - Here are few methods to install R in the web based environment: Interested in running R on the Web?  These sites are good for trying out quick analysis etc. R-Fiddle is a good choice, however there are other including: R-Web, ideaone.com, Jupyter, DataJoy, tutorialspoint, and Anaconda Cloud are just a few examples. Command Line – If you spend most of your time in a text editor, try ESS (Emacs Speaks Statistics) How is a predictive analytics project organized? After you install R on your own machine, I would give some thought about how you want to organize your data, code, documentation, etc. There probably be many different kinds of projects that you will need to set up, all ranging from exploratory analysis, to full production grade implementations.  However, most projects will be somewhere in the middle, i.e. those projects which ask a specific question or a series of related questions.  Whatever their purpose, each project you will work on will deserve their own project folder or directory. Set up your Project and Subfolders We will start by creating folders for our environment. Create a sub directory named “PracticalPredictiveAnalytics” somewhere on your computer. We will be referring to it by this name throughout this book. Often project start with 3 sub folders which roughly correspond with 1) Data Source, 2) Code Generated Outputs, and 3) The Code itself (in this case R) Create 3 subdirectories under this Project Data, Outputs, and R. The R directory will hold all of our data prep code, algorithms etc.  The Data directory will contain our raw data sources, and the Output directory will contain anything generated by the code.  This can be done natively within your own environment, e.g. you can use Windows Explorer to create these folders. Some important points to remember about constructing projects It is never a good idea to ‘boil the ocean’, or try to answer too many questions at once. Remember, predictive analytics is an iterative process. Another trap that people fall into is not having their project reproducible.  Nothing is worse than to develop some analytics on a set of data, and then backtrack, and oops! Different results. When organizing code, try to write code as building block, which can be reusable.  For R, write code liberally as functions. Assume that anything concerning requirements, data, and outputs will change, and be prepared. Considering the dynamic nature of the R language. Changes in versions, and packages could all change your analysis in various ways, so it is important to keep code and data in sync, by using separate folders for the different levels of code, data, etc.  or by using version management package use as subversion, git, or cvs GUI’s R, like many languages and knowledge discovery systems started from the command line (one reason to learn Linux), and is still used by many.  However, predictive analysts tend to prefer Graphic User Interfaces, and there are many choices available for each of the 3 different operating systems.   Each of them have their strengths and weakness, and of course there is always a matter of preference.  Memory is always a consideration with R, and if that is of critical concern to you, you might want to go with a simpler GUI, like the one built in with R. If you want full control, and you want to add some productive tools, you could choose RStudio, which is a full blown GUI and allows you to implement version control repositories, and has nice features like code completion.   RCmdr, and Rattle’s unique features are that they offer menus which allow guided point and click commands for common statistical and data mining tasks.  They are always both code generators.  This is a good for learning, and you can learn by looking at the way code is generated. Both RCmdr and RStudio offer GUI's which are compatible with Windows, Apple, and Linux operator systems, so those are the ones I will use to demonstrate examples in this book.  But bear in mind that they are only user interfaces, and not R proper, so, it should be easy enough to paste code examples into other GUI’s and decide for yourself which ones you like.   Getting started with RStudio After R installation has completed, download and install the RStudio executable appropriate for your operating system Click the RStudio Icon to bring up the program:  The program initially starts with 3 tiled window panes, as shown below. Before we begin to do any actual coding, we will want to set up a new Project. Create a new project by following these steps: Identify the Menu Bar, above the icons at the top left of the screen. Click “File” and then “New Project”   Select “Create project from Directory” Select “Empty Project” Name the directory “PracticalPredictiveAnalytics” Then Click the Browse button to select your preferred directory. This will be the directory that you created earlier Click “Create Project” to complete The R Console  Now that we have created a project, let’s take a look at of the R Console Window.  Click on the window marked “Console” and perform the following steps: Enter getwd() and press enter – That should echo back the current working directory Enter dir() – That will give you a list of everything in the current working directory The getwd() command is very important since it will always tell you which directory you are in. Sometimes you will need to switch directories within the same project or even to another project. The command you will use is setwd().  You will supply the directory that you want to switch to, all contained within the parentheses. This is a situation we will come across later. We will not change anything right now.  The point of this, is that you should always be aware of what your current working directory is. The Script Window The script window is where all of the R Code is written.  You can have several script windows open, all at once. Press Ctrl + Shift + N  to create a new R script.  Alternatively, you can go through the menu system by selecting File/New File/R Script.   A new blank script window will appear with the name “Untitled1” Our First Predictive Model Now that all of the preliminary things are out of the way, we will code our first extremely simple predictive model. Our first R script is a simple two variable regression model which predicts women’s height based upon weight.  The data set we will use is already built into the R package system, and is not necessary to load externally.   For quick illustration of techniques, I will sometimes use sample data contained within specific R packages to demonstrate. Paste the following code into the “Untitled1” scripts that was just created: require(graphics) data(women) head(women) utils::View(women) plot(women$height,women$weight) Click Ctrl+Shift+Enter to run the entire code.  The display should change to something similar as displayed below. Code Description What you have actually done is: Load the “Women” data object. The data() function will load the specified data object into memory. In our case, data(women)statement says load the 'women' dataframe into memory. Display the raw data in three different ways: utils::View(women) – This will visually display the dataframe. Although this is part of the actual R script, viewing a dataframe is a very common task, and is often issued directly as a command via the R Console. As you can see in the figure above, the “Women” data frame has 15 rows, and 2 columns named height and weight. plot(women$height,women$weight) – This uses the native R plot function which plots the values of the two variables against each other.  It is usually the first step one does to begin to understand the relationship between 2 variables. As you can see the relationship is very linear. Head(women) – This displays the first N rows of the women  data frame to the console. If you want no more than a certain number of rows, add that as a 2nd argument of the function.  E.g.  Head(women,99) will display UP TO 99 rows in the console. The tail() function works similarly, but displays the last rows of data. The very first statement in the code “require” is just a way of saying that R needs a specific package to run.  In this case require(graphics) specifies that the graphics package is needed for the analysis, and it will load it into memory.  If it is not available, you will get an error message.  However, “graphics” is a base package and should be available To save this script, press Ctrl-S (File Save) , navigate to the PracticalPredictiveAnalytics/R folder that was created, and name it Chapter1_DataSource Your 2nd script Create another Rscript by Ctrl + Shift + N  to create a new R script.  A new blank script window will appear with the name “Untitled2” Paste the following into the new script window lm_output <- lm(women$height ~ women$weight) summary(lm_output) prediction <- predict(lm_output) error <- women$height-prediction plot(women$height,error) Press Ctrl+Shift+Enter to run the entire code.  The display should change to something similar to what is displayed below. Code Description Here are some notes and explanations for the script code that you have just ran: lm() function: This functionruns a simple linear regression using lm() function. This function  predicts woman’s height based upon the value of their weight.  In statistical parlance, you will be 'regressing' height on weight. The line of code which accomplishes this is: lm_output <- lm(women$height ~ women$weight There are two operations that you will become very familiar with when running Predictive Models in R. The ~ operator (also called the tilde) is a shorthand way for separating what you want to predict, with what you are using to predict.   This is expression in formula syntax. What you are predicting (the dependent or Target variable) is usually on the left side of the formula, and the predictors (independent variables, features) are on the right side. Independent and dependent variables are height and weight, and to improve readability, I have specified them explicitly by using the data frame name together with the column name, i.e. women$height and women$weight The <- operator (also called assignment) says assign whatever function operators are on the right side to whatever object is on the left side.  This will always create or replace a new object that you can further display or manipulate. In this case we will be creating a new object called lm_output, which is created using the function lm(), which creates a Linear model based on the formula contained within the parentheses. Note that the execution of this line does not produce any displayed output.  You can see if the line was executed by checking the console.  If there is any problem with running the line (or any line for that matter) you will see an error message in the console. summary(lm_output): The following statement displays some important summary information about the object lm_output and writes to output to the R Console as pictured above summary(lm_output) The results will appear in the Console window as pictured in the figure above. Look at the lines market (Intercept), and women$weight which appear under the Coefficients line in the console.  The Estimate Column shows the formula needed to derive height from weight.  Like any linear regression formula, it includes coefficients for each independent variable (in our case only one variable), as well as an intercept. For our example the English rule would be "Multiply weight by 0.2872 and add 25.7235 to obtain height". Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 25.723456 1.043746 24.64 2.68e-12 *** women$weight 0.287249 0.007588 37.85 1.09e-14 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.44 on 13 degrees of freedom Multiple R-squared: 0.991, Adjusted R-squared: 0.9903 F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14 We have already assigned the output of the lm() function to the lm_output object. Let’s apply another function to lm_output as well. The predict() function “reads” the output of the lm function and predicts (or scores the value), based upon the linear regression equation.  In the code we have assigned the output of this function to a new object named "prediction”. Switch over to the console area, and type “prediction” to see the predicted values for the 15 women. The following should appear in the console. > prediction 1 2 3 4 5 6 7 58.75712 59.33162 60.19336 61.05511 61.91686 62.77861 63.64035 8 9 10 11 12 13 14 64.50210 65.65110 66.51285 67.66184 68.81084 69.95984 71.39608 15 72.83233 There are 15 predictions.  Just to verify that we have one for each of our original observations we will use the nrow() function to count the number of rows. At the command prompt in the console area, enter the command: nrow(women) The following should appear: >nrow(women) [1] 15 The error object is a vector that was computed by taking the difference between the predicted value of height and the actual height.  These are also known as the residual errors, or just residuals. Since the error object is a vector, you cannot use the nrows() function to get its size.   But you can use the length() function: >length(error) [1] 15 In all of the above cases, the counts all compute as 15, so all is good. plot(women$height,error) :This plots the predicted height vs. the residuals.  It shows how much the prediction was ‘off’ from the original value.  You can see that the errors show a non-random pattern.  This is not good.  In an ideal regression model, you expect to see prediction errors randomly scatter around the 0 point on the why axis. Some important points to be made regarding this first example: The R-Square for this model is artificially high. Regression is often used in an exploratory fashion to explore the relationship between height and weight.  This does not mean a causal one.  As we all know, weight is caused by many other factors, and it is expected that taller people will be heavier. A predictive modeler who is examining the relationship between height and weight would want probably want to introduce additional variables into the model at the expense of a lower R-Square. R-Squares can be deceiving, especially when they are artificially high After you are done, press Ctrl-S (File Save), navigate to the PracticalPredictiveAnalytics/R folder that was created, and name it Chapter1_LinearRegression Installing a package Sometimes the amount of information output by statistic packages can be overwhelming. Sometime we want to reduce the amount of output and reformat it so it is easier on the eyes. Fortunately, there is an R package which reformats and simplifies some of the more important statistics. One package I will be using is named “stargazer”. Create another R script by Ctrl + Shift + N  to create a new R script.  Enter the following lines and then Press Ctrl+Shift+Enter to run the entire script.  install.packages("stargazer") library(stargazer) stargazer(lm_output, title="Lm Regression on Height", type="text") After the script has been run, the following should appear in the Console: Code Description install.packages("stargazer") This line will install the package to the default package directory on your machine.  Make sure you choose a CRAN mirror before you download. library(stargazer) This line loads the stargazer package stargazer(lm_output, title="Lm Regression on Height", type="text") The reformatted results will appear in the R Console. As you can see, the output written to the console is much cleaner and easier to read  After you are done, press Ctrl-S (File Save), navigate to the PracticalPredictiveAnalytics/Outputs folder that was created, and name it Chapter1_LinearRegressionOutput Installing other packages The rest of the book will concentrate on what I think are the core packages used for predictive modeling. There are always new packages coming out. I tend to favor packages which have been on CRAN for a long time and have large user base. When installing something new, I will try to reference the results against other packages which do similar things.  Speed is another reason to consider adopting a new package. Summary In this article we have learned a little about what predictive analytics is and how they can be used in various industries. We learned some things about data, and how they can be organized in projects.  Finally, we installed RStudio, and ran a simple linear regression, and installed and used our first package. We learned that it is always good practice to examine data after it has been brought into memory, and a lot can be learned from simply displaying and plotting the data. Resources for Article: Further resources on this subject: Stata as Data Analytics Software [article] Metric Analytics with Metricbeat [article] Big Data Analytics [article]
Read more
  • 0
  • 0
  • 3532

article-image-getting-started-python-and-machine-learning
Packt
23 Jun 2017
34 min read
Save for later

Getting Started with Python and Machine Learning

Packt
23 Jun 2017
34 min read
In this article by Ivan Idris, Yuxi (Hayden) Liu, and Shoahoa Zhang author of the book Python Machine Learning By Example we cover basic machine learning concepts. If you need more information, then your local library, the Internet, and Wikipedia in particular should help you further. The topics that will be covered in this article are as follows: (For more resources related to this topic, see here.) What is machine learning and why do we need it? A very high level overview of machine learning Generalizing with data Overfitting and the bias variance trade off Dimensions and features Preprocessing, exploration, and feature engineering Combining models Installing software and setting up Troubleshooting and asking for help What is machine learning and why do we need it? Machine learning is a term coined around 1960 composed of two words – machine corresponding to a computer, robot, or other device, and learning an activity, which most humans are good at. So we want machines to learn, but why not leave learning to the humans? It turns out that there are many problems involving huge datasets, for instance, where it makes sense to let computers do all the work. In general, of course, computers and robots don't get tired, don't have to sleep, and may be cheaper. There is also an emerging school of thought called active learning or human-in-the-loop, which advocates combining the efforts of machine learners and humans. The idea is that there are routine boring tasks more suitable for computers, and creative tasks more suitable for humans. According to this philosophy machines are able to learn, but not everything. Machine learning doesn't involve the traditional type of programming that uses business rules. A popular myth says that the majority of the code in the world has to do with simple rules possibly programmed in Cobol, which covers the bulk of all the possible scenarios of client interactions. So why can't we just hire many coders and continue programming new rules? One reason is the cost of defining and maintaining rules becomes very expensive over time. A lot of the rules involve matching strings or numbers, and change frequently. It's much easier and more efficient to let computers figure out everything themselves from data. Also the amount of data is increasing, and actually the rate of growth is itself accelerating. Nowadays the floods of textual, audio, image, and video data are hard to fathom. The Internet of Things is a recent development of a new kind of Internet, which interconnects everyday devices. The Internet of Things will bring data from household appliances and autonomous cars to the forefront. The average company these days has mostly human clients, but for instance social media companies tend to have many bot accounts. This trend is likely to continue and we will have more machines talking to each other. An application of machine learning that you may be familiar is the spam filter, which filters e-mails considered to be spam. Another is online advertising where ads are served automatically based on information advertisers have collected about us. Yet another application of machine learning is search engines. Search engines extract information about web pages, so that you can efficiently search the Web. Many online shops and retail companies have recommender engines, which recommend products and services using machine learning. The list of applications is very long and also includes fraud detection, medical diagnosis, sentiment analysis, and financial market analysis. In the 1983, War Games movie a computer made life and death decisions, which could have resulted in Word War III. As far as I know technology wasn't able to pull off such feats at the time. However, in 1997 the Deep Blue supercomputer did manage to beat a world chess champion. In 2005, a Stanford self-driving car drove by itself for more than 130 kilometers in a desert. In 2007, the car of another team drove through regular traffic for more than 50 kilometers. In 2011, the Watson computer won a quiz against human opponents. In 2016 the AlphaGo program beat one of the best Go players in the world. If we assume that computer hardware is the limiting factor, then we can try to extrapolate into the future. Ray Kurzweil did just that and according to him, we can expect human level intelligence around 2029. A very high level overview of machine learning Machine learning is a subfield of artificial intelligence—a field of computer science concerned with creating systems, which mimic human intelligence. Software engineering is another field of computer science, and we can label Python programming as a type of software engineering. Machine learning is also closely related to linear algebra, probability theory, statistics, and mathematical optimization. We use optimization and statistics to find the best models, which explain our data. If you are not feeling confident about your mathematical knowledge, you probably are wondering, how much time you should spend learning or brushing up. Fortunately to get machine learning to work for you, most of the time you can ignore the mathematical details as they are implemented by reliable Python libraries. You do need to be able to program. As far as I know if you want to study machine learning, you can enroll into computer science, artificial intelligence, and more recently, data science masters. There are various data science bootcamps; however, the selection for those is stricter and the course duration is often just a couple of months. You can also opt for the free massively online courses available on the Internet. Machine learning is not only a skill, but also a bit of sport. You can compete in several machine learning competitions; sometimes for decent cash prizes. However, to win these competitions, you may need to utilize techniques, which are only useful in the context of competitions and not in the context of trying to solve a business problem. A machine learning system requires inputs—this can be numerical, textual, visual, or audiovisual data. The system usually has outputs, for instance, a floating point number, an integer representing a category (also called a class), or the acceleration of a self-driving car. We need to define (or have it defined for us by an algorithm) an evaluation function called loss or cost function, which tells us how well we are learning. In this setup, we create an optimization problem for us with the goal of learning in the most efficient way. For instance, if we fit data to a line also called linear regression, the goal is to have our line be as close as possible to the data points on average. We can have unlabeled data, which we want to group or explore – this is called unsupervised learning. Unsupervised learning can be used to detect anomalies, such as fraud or defective equipment. We can also have labeled examples to use for training—this is called supervised learning. The labels are usually provided by human experts, but if the problem is not too hard, they also may be produced by any members of the public through crowd sourcing for instance. Supervised learning is very common and we can further subdivide it in regression and classification. Regression trains on continuous target variables, while classification attempts to find the appropriate class label. If not all examples are labeled, but still some are we have semi supervised learning. A chess playing program usually applies reinforcement learning—this is a type of learning where the program evaluates its own performance by, for instance playing against itself or earlier versions of itself. We can roughly categorize machine learning algorithms in logic-based learning, statistical learning, artificial neural networks, and genetic algorithms. In fact, we have a whole zoo of algorithms with popularity varying over time. The logic-based systems were the first to be dominant. They used basic rules specified by human experts, and with these rules systems tried to reason using formal logic. In the mid-1980s, artificial neural networks came to the foreground, to be then pushed aside by statistical learning systems in the 1990s. Artificial neural networks imitate animal brains, and consist of interconnected neurons that are also an imitation of biological neurons. Genetic algorithms were pretty popular in the 1990s (or at least that was my impression). Genetic algorithms mimic the biological process of evolution. We are currently (2016) seeing a revolution in deep learning, which we may consider to be a rebranding of neural networks. The term deep learning was coined around 2006, and it refers to deep neural networks with many layers. The breakthrough in deep learning is amongst others caused by the availability of graphics processing units (GPU), which speed up computation considerably. GPUs were originally intended to render video games, and are very good in parallel matrix and vector algebra. It is believed that deep learning resembles, the way humans learn, therefore it may be able to deliver on the promise of sentient machines. You may have heard of Moore's law—an empirical law, which claims that computer hardware improves exponentially with time. The law was first formulated by Gordon Moore, the co-founder of Intel, in 1965. According to the law the number of transistors on a chip should double every two years. In the following graph, you can see that the law holds up nicely (the size of the bubbles corresponds to the average transistor count in GPUs): The consensus seems to be that Moore's law should continue to be valid for a couple of decades. This gives some credibility to Ray Kurzweil's predictions of achieving true machine intelligence in 2029. We will encounter many of the types of machine learning later in this book. Most of the content is about supervised learning with some examples of unsupervised learning. The popular Python libraries support all these types of learning. Generalizing with data The good thing about data is that we have a lot of data in the world. The bad thing is that it is hard to process this data. The challenges stem from the diversity and noisiness of the data. We as humans, usually process data coming in our ears and eyes. These inputs are transformed into electrical or chemical signals. On a very basic level computers and robots also work with electrical signals. These electrical signals are then translated into ones and zeroes. However, we program in Python in this book, and on that level normally we represent the data either as numbers, images, or text. Actually images and text are not very convenient, so we need to transform images and text into numerical values. Especially in the context of supervised learning we have a scenario similar to studying for an exam. We have a set of practice questions and the actual exam. We should be able to answer exam questions without knowing the answers for them. This is called generalization – we learn something from our practice questions and hopefully are able to apply this knowledge to other similar questions. Finding good representative examples is not always an easy task depending on the complexity of the problem we are trying to solve and how generic the solution needs to be. An old-fashioned programmer would talk to a business analyst or other expert, and then implement a rule that adds a certain value multiplied by another value corresponding for instance to tax rules. In a machine learning setup we can give the computer example input values and example output values. Or if we are more ambitious, we can feed the program the actual tax texts and let the machine process the data further just like an autonomous car doesn't need a lot of human input. This means implicitly that there is some function, for instance, a tax formula that we are trying to find. In physics we have almost the same situation. We want to know how the universe works, and formulate laws in a mathematical language. Since we don't know the actual function all we can do is measure what our error is, and try to minimize it. In supervised learning we compare our results against the expected values. In unsupervised learning we measure our success with related metrics. For instance, we want clusters of data to be well defined. In reinforcement learning a program evaluates its moves, for example, in a chess game using some predefined function. Overfitting and the bias variance trade off Overfitting, (one word) is such an important concept, that I decided to start discussing it very early in this book. If you go through many practice questions for an exam, you may start to find ways to answer questions, which have nothing to do with the subject material. For instance, you may find that if you have the word potato in the question, the answer is A, even if the subject has nothing to do with potatoes. Or even worse, you may memorize the answers for each question verbatim. We can then score high on the practice questions, which are called the train set in machine learning. However, we will score very low on the exam questions, which are called the test set in machine learning. This phenomenon of memorization is called bias. Overfitting almost always means that we are trying too hard to extract information from the data (random quirks), and using more training data will not help. The opposite scenario is called underfitting. When we underfit we don't have a very good result on the train set, but also not a very good result on the test set. Underfitting may indicate that we are not using enough data, or that we should try harder to do something useful with the data. Obviously we want to avoid both scenarios. Machine learning errors are the sum of bias and variance. Variance measures how much the error varies. Imagine that we are trying to decide what to wear based on the weather outside. If you have grown up in a country with a tropical climate, you may be biased towards always wearing summer clothes. If you lived in a cold country, you may decide to go to a beach in Bali wearing winter clothes. Both decisions have high bias. You may also just wear random clothes that are not appropriate for the weather outside – this is an outcome with high variance. It turns out that when we try to minimize bias or variance, we tend to increase the other – a concept known as the bias variance tradeoff. The expected error is given by the following equation: The last term is the irreducible error. Cross-validation Cross-validation is a technique, which helps to avoid overfitting. Just like we would have a separation of practice questions and exam questions, we also have training sets and test sets. It's important to keep the data separated and only use the test set in the final stage. There are many cross-validation schemes in use. The most basic setup splits the data given a specified percentage – for instance 75 % train data and 25 % test set. We can also leave out a fixed number of observations in multiple rounds, so that these items are in the test set, and the rest of the data is in the train set. For instance, we can apply leave-one-outcross-validation (LOOCV) and let each datum be in the test set once. For a large dataset LOOCV requires as many rounds as there are data items, and can therefore be too slow. The k-fold cross-validation performs better than LOOCV and randomly splits the data into k (a positive integer) folds. One of the folds becomes the test set, and the rest of the data becomes the train set. We repeat this process k times with each fold being the designated test set once. Finally, we average the k train and test errors for the purpose of evaluation. Common values for k are five and ten. The following table illustrates the setup for five folds: Iteration Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 1 Test Train Train Train Train 2 Train Test Train Train Train 3 Train Train Test Train Train 4 Train Train Train Test Train 5 Train Train Train Train Test  We can also randomly split the data into train and test set multiple times. The problem with this algorithm is that some items may never end in the test set, while others may be selected in the test set multiple times. The nested cross-validation is a combination of cross-validations. Nested cross-validation consists of the following cross-validations: The inner cross-validation does optimization to find the best fit, and can be implemented as k-fold cross validation. The outer cross-validation is used to evaluate performance and do statistical analysis. Regularization Regularization like cross-validation is a general technique to fight overfitting. Regularization adds extra parameters to the error function we are trying to minimize, in order to penalize complex models. For instance, if we are trying to fit a curve to a high order polynomial, we may use regularization to reduce the influence of the higher degrees, thereby effectively reducing the order of the polynomial. Simpler methods are to be preferred, at least according to the principle of Occam's razor. William Occam was a monk and philosopher who around 1320 came up with the idea that the simplest hypothesis that fits data should be preferred. One justification is that you can invent fewer simple models than complex models. For instance, intuitively we know that there are more higher polynomial models than linear ones. The reason is that a line (f(x) = ax + b) is governed by only two numbers – the intercept b and slope a. The possible coefficients for a line span two-dimensional space. A quadratic polynomial adds an extra coefficient for the quadratic term, and we can span a three-dimensional space with the coefficients. Therefore it is less likely that we find a linear model than more complex models, because the search space for linear models is much smaller (although it is infinite). And of course simpler models are just easier to use and require less computation time. We can also stop a program early as a form of regularization. If we give a machine learner less time it is more likely to produce a simpler model, and we hope less likely to overfit. Of course, we have to do this in an ordered fashion, so this means that the algorithm should be aware of the possibility of early termination. Dimensions and features We typically represent the data as a grid of numbers (a matrix). Each column represents a variable, which we call a feature in machine learning. In supervised learning one of the variables is actually not a feature, but the label that we are trying to predict. And in supervised learning each row is an example that we can use for training or testing. The number of features corresponds to the dimensionality of the data. Our machine learning approach depends on the number of dimensions versus the number of examples. For instance, text and image data are very high dimensional, while stock market data has relatively fewer dimensions. Fitting high dimensional data is computationally expensive, and is also prone to overfitting due to the high complexity. Higher dimensions are also impossible to visualize, and therefore we can't use simple diagnostic methods. Not all the features are useful and they may only add randomness to our results. It is therefore often important to do good feature selection. Feature selection is a form of dimensionality reduction. Some machine learning algorithms are actually able to automatically perform feature selection. We should be careful not to omit features that do contain information. Some practitioners evaluate the correlation of a single feature and the target variable. In my opinion you shouldn't do that, because correlation of one variable with the target in itself doesn't tell you much, instead use an off-the-shelf algorithm and evaluate the results of different feature subsets. In principle, feature selection boils down to multiple binary decisions whether to include a feature or not. For n features we get 2n feature sets, which can be a very large number for a large number of features. For example, for 10 features we have 1024 possible feature sets (for instance if we are deciding what clothes to wear the features can be temperature, rain, the weather forecast, where we are going, and so on). At a certain point brute force evaluation becomes infeasible. We will discuss better methods in this book. Basically we have two options: we either start with all the features, and remove features iteratively or we start with a minimum set of features and add features iteratively. We then take the best feature sets for each iteration, and then compare them. Another common dimensionality reduction approach is to transform high-dimensional data in lower-dimensional space. This transformation leads to information loss, but we can keep the loss to a minimum. We will cover this in more detail later on. Preprocessing, exploration, and feature engineering Data mining, a buzzword in the 1990 is the predecessor of data science (the science of data). One of the methodologies popular in the data mining community is called cross industry standard process for data mining (CRISP DM). CRISP DM was created in 1996, and is still used today. I am not endorsing CRISP DM, however I like its general framework. The CRISP DM consists of the following phases, which are not mutually exclusive and can occur in parallel: Business understanding: This phase is often taken care of by specialized domain experts. Usually we have a business person formulate a business problem, such as selling more units of a certain product. Data understanding: This is also a phase, which may require input from domain experts, however, often a technical specialist needs to get involved more than in the business understanding phase. The domain expert may be proficient with spreadsheet programs, but have trouble with complicated data. In this book I usually call this phase exploration. Data preparation: This is also a phase where a domain expert with only Excel knowhow may not be able to help you. This is the phase where we create our training and test datasets. In this book I usually call this phase preprocessing. Modeling: This is the phase, which most people associate with machine learning. In this phase we formulate a model, and fit our data. Evaluation: In this phase, we evaluate our model, and our data to check whether we were able to solve our business problem. Deployment: This phase usually involves setting up the system in a production environment (it is considered good practice to have a separate production system). Typically this is done a specialized team. When we learn, we require high quality learning material. We can't learn from gibberish, so we automatically ignore anything that doesn't make sense. A machine learning system isn't able to recognize gibberish, so we need to help it by cleaning the input data. It is often claimed that cleaning the data forms a large part of machine learning. Sometimes cleaning is already done for us, but you shouldn't count on it. To decide how to clean the data, we need to be familiar with the data. There are some projects, which try to automatically explore the data, and do something intelligent, like producing a report. For now unfortunately we don't have a solid solution, so you need to do some manual work. We can do two things, which are not mutually exclusive – first scan the data and second visualizing the data. This also depends on the type of data we are dealing with whether we have a grid of numbers, images, audio, text, or something else. At the end, a grid of numbers is the most convenient form, and we will always work towards having numerical features. We want to know if features miss values, how the values are distributed and what type of features we have. Values can approximately follow a Normal distribution, a Binomial distribution, a Poisson distribution or another distribution altogether. Features can be binary – either yes or no, positive or negative, and so on. They can also be categorical – pertaining to a category, for instance continents (Africa, Asia, Europe, Latin America, North America, and so on).Categorical variables can also be ordered – for instance high, medium, and low. Features can also be quantitative – for example temperature in degrees or price in dollars. Feature engineering is the process of creating or improving features. It's more of a dark art than a science. Features are often created based on common sense, domain knowledge or prior experience. There are certain common techniques for feature creation, however there is no guarantee that creating new features will improve your results. We are sometimes able to use the clusters found by unsupervised learning as extra features. Deep neural networks are often able to create features automatically. Missing values Quite often we miss values for certain features. This could happen for various reasons. It can be inconvenient, expensive or even impossible to always have a value. Maybe we were not able to measure a certain quantity in the past, because we didn't have the right equipment, or we just didn't know that the feature is relevant. However, we are stuck with missing values from the past. Sometimes it's easy to figure out that we miss values and we can discover this just by scanning the data, or counting the number of values we have for a feature and comparing to the number of values we expect based on the number of rows. Certain systems encode missing values with for example values as 999999. This makes sense if the valid values are much smaller than 999999. If you are lucky you will have information about the features provided by whoever created the data in the form of a data dictionary or metadata. Once we know that we miss values the question arises of how to deal with them. The simplest answer is to just ignore them. However, some algorithms can't deal with missing values, and the program will just refuse to continue. In other circumstances ignoring missing values will lead to inaccurate results. The second solution is to substitute missing values by a fixed value – this is called imputing. We can impute the arithmetic mean, median or mode of the valid values of a certain feature. Ideally, we will have a relation between features or within a variable that is somewhat reliable. For instance, we may know the seasonal averages of temperature for a certain location and be able to impute guesses for missing temperature values given a date. Label encoding Humans are able to deal with various types of values. Machine learning algorithms with some exceptions need numerical values. If we offer a string such as Ivan unless we are using specialized software the program will not know what to do. In this example we are dealing with a categorical feature – names probably. We can consider each unique value to be a label. (In this particular example we also need to decide what to do with the case – is Ivan the same as ivan). We can then replace each label by an integer – label encoding. This approach can be problematic, because the learner may conclude that there is an ordering. One hot encoding The one-of-K or one-hot-encoding scheme uses dummy variables to encode categorical features. Originally it was applied to digital circuits. The dummy variables have binary values like bits, so they take the values zero or one (equivalent to true or false). For instance, if we want to encode continents we will have dummy variables, such as is_asia, which will be true if the continent is Asia and false otherwise. In general, we need as many dummy variables, as there are unique labels minus one. We can determine one of the labels automatically from the dummy variables, because the dummy variables are exclusive. If the dummy variables all have a false value, then the correct label is the label for which we don't have a dummy variable. The following table illustrates the encoding for continents:   Is_africa Is_asia Is_europe Is_south_america Is_north_america Africa True False False False False Asia False True False False False Europe False False True False False South America False False False True False North America False False False False True Other False False False False False  The encoding produces a matrix (grid of numbers) with lots of zeroes (false values) and occasional ones (true values). This type of matrix is called a sparse matrix. The sparse matrix representation is handled well by the SciPy package, and shouldn't be an issue. We will discuss the SciPy package later in this article. Scaling Values of different features can differ by orders of magnitude. Sometimes this may mean that the larger values dominate the smaller values. This depends on the algorithm we are using. For certain algorithms to work properly we are required to scale the data. There are several common strategies that we can apply: Standardization removes the mean of a feature and divides by the standard deviation. If the feature values are normally distributed, we will get a Gaussian, which is centered around zero with a variance of one. If the feature values are not normally distributed, we can remove the median and divide by the interquartile range. The interquartile range is a range between the first and third quartile (or 25th and 75th percentile). Scaling features to a range is a common choice of range which is a range between zero and one. Polynomial features If we have two features a and b, we can suspect that there is a polynomial relation, such as a2 + ab + b2. We can consider each term in the sum to be a feature – in this example we have three features. The product ab in the middle is called an interaction. An interaction doesn't have to be a product, although this is the most common choice, it can also be a sum, a difference or a ratio. If we are using a ratio to avoid dividing by zero, we should add a small constant to the divisor and dividend. The number of features and the order of the polynomial for a polynomial relation are not limited. However, if we follow Occam's razor we should avoid higher order polynomials and interactions of many features. In practice, complex polynomial relations tend to be more difficult to compute and not add much value, but if you really need better results they may be worth considering. Power transformations Power transforms are functions that we can use to transform numerical features into a more convenient form, for instance to conform better to a normal distribution. A very common transform for values, which vary by orders of magnitude, is to take the logarithm. Taking the logarithm of a zero and negative values is not defined, so we may need to add a constant to all the values of the related feature before taking the logarithm. We can also take the square root for positive values, square the values or compute any other power we like. Another useful transform is the Box-Cox transform named after its creators. The Box-Cox transform attempts to find the best power need to transform the original data into data that is closer to the normal distribution. The transform is defined as follows: Binning Sometimes it's useful to separate feature values into several bins. For example, we may be only interested whether it rained on a particular day. Given the precipitation values we can binarize the values, so that we get a true value if the precipitation value is not zero, and a false value otherwise. We can also use statistics to divide values into high, low, and medium bins. The binning process inevitably leads to loss of information. However, depending on your goals this may not be an issue, and actually reduce the chance of overfitting. Certainly there will be improvements in speed and memory or storage requirements. Combining models In (high) school we sit together with other students, and learn together, but we are not supposed to work together during the exam. The reason is, of course, that teachers want to know what we have learned, and if we just copy exam answers from friends, we may have not learned anything. Later in life we discover that teamwork is important. For example, this book is the product of a whole team, or possibly a group of teams. Clearly a team can produce better results than a single person. However, this goes against Occam's razor, since a single person can come up with simpler theories compared to what a team will produce. In machine learning we nevertheless prefer to have our models cooperate with the following schemes: Bagging Boosting Stacking Blending Voting and averaging Bagging Bootstrap aggregating or bagging is an algorithm introduced by Leo Breiman in 1994, which applies bootstrapping to machine learning problems. Bootstrapping is a statistical procedure, which creates datasets from existing data by sampling with replacement. Bootstrapping can be used to analyze the possible values that arithmetic mean, variance or other quantity can assume. The algorithm aims to reduce the chance of overfitting with the following steps: We generate new training sets from input train data by sampling with replacement. Fit models to each generated training set. Combine the results of the models by averaging or majority voting. Boosting In the context of supervised learning we define weak learners as learners that are just a little better than a baseline such as randomly assigning classes or average values. Although weak learners are weak individually like ants, together they can do amazing things just like ants can. It makes sense to take into account the strength of each individual learner using weights. This general idea is called boosting. There are many boosting algorithms; boosting algorithms differ mostly in their weighting scheme. If you have studied for an exam, you may have applied a similar technique by identifying the type of practice questions you had trouble with and focusing on the hard problems. Face detection in images is based on a specialized framework, which also uses boosting. Detecting faces in images or videos is a supervised learning. We give the learner examples of regions containing faces. There is an imbalance, since we usually have far more regions (about ten thousand times more) that don't have faces. A cascade of classifiers progressively filters out negative image areas stage by stage. In each progressive stage the classifiers use progressively more features on fewer image windows. The idea is to spend the most time on image patches, which contain faces. In this context boosting is used to select features and combine results. Stacking Stacking takes the outputs of machine learning estimators and then uses those as inputs for another algorithm. You can, of course, feed the output of the higher-level algorithm to another predictor. It is possible to use any arbitrary topology, but for practical reasons you should try a simple setup first as also dictated by Occam's razor. Blending Blending was introduced by the winners of the one million dollar Netflix prize. Netflix organized a contest with the challenge of finding the best model to recommend movies to their users. Netflix users can rate a movie with a rating of one to five stars. Obviously each user wasn't able to rate each movie, so the user movie matrix is sparse. Netflix published an anonymized training and test set. Later researchers found a way to correlate the Netflix data to IMDB data. For privacy reasons the Netflix data is no longer available. The competition was won in 2008 by a group of teams combining their models. Blending is a form of stacking. The final estimator in blending however trains only on a small portion of the train data. Voting and averaging We can arrive at our final answer through majority voting or averaging. It's also possible to assign different weights to each model in the ensemble. For averaging we can also use the geometric mean or the harmonic mean instead of the arithmetic mean. Usually combining the results of models, which are highly correlated to each other doesn't lead to spectacular improvements. It's better to somehow diversify the models, by using different features or different algorithms. If we find that two models are strongly correlated, we may for example decide to remove one of them from the ensemble, and increase proportionally the weight of the other model. Installing software and setting up For most projects in this book we need scikit-learn (Refer, http://scikit-learn.org/stable/install.html) and matplotlib (Refer, http://matplotlib.org/users/installing.html). Both packages require NumPy, but we also need SciPy for sparse matrices as mentioned before. The scikit-learn library is a machine learning package, which is optimized for performance as a lot of the code runs almost as fast as equivalent C code. The same statement is true for NumPy and SciPy. There are various ways to speed up the code, however they are out of scope for this book, so if you want to know more, please consult the documentation. Matplotlib is a plotting and visualization package. We can also use the seaborn package for visualization. Seaborn uses matplotlib under the hood. There are several other Python visualization packages that cover different usage scenarios. Matplotlib and seaborn are mostly useful for the visualization for small to medium datasets. The NumPy package offers the ndarray class and various useful array functions. The ndarray class is an array, that can be one or multi-dimensional. This class also has several subclasses representing matrices, masked arrays, and heterogeneous record arrays. In machine learning we mainly use NumPy arrays to store feature vectors or matrices composed of feature vectors. SciPy uses NumPy arrays and offers a variety of scientific and mathematical functions. We also require the pandas library for data wrangling. In this book we will use Python 3. As you may know Python 2 will no longer be supported after 2020, so I strongly recommend switching to Python 3. If you are stuck with Python 2 you still should be able to modify the example code to work for you. In my opinion the Anaconda Python 3 distribution is the best option. Anaconda is a free Python distribution for data analysis and scientific computing. It has its own package manager conda. The distribution includes more than 200 Python packages, which makes it very convenient. For casual users the Miniconda distribution may be the better choice. Miniconda contains the conda package manager and Python. The procedures to install Anaconda and Miniconda are similar. Obviously, Anaconda takes more disk space. Follow the instructions from the Anaconda website at http://conda.pydata.org/docs/install/quick.html. First, you have to download the appropriate installer for your operating system and Python version. Sometimes you can choose between a GUI and a command line installer. I used the Python 3 installer, although my system Python version is 2.7. This is possible since Anaconda comes with its own Python. On my machine the Anaconda installer created an anaconda directory in my home directory and required about 900 MB. The Miniconda installer installs a miniconda directory in your home directory. Installation instructions for NumPy are at http://docs.scipy.org/doc/numpy/user/install.html. Alternatively install NumPy with pip as follows: $ [sudo] pip install numpy The command for Anaconda users is: $ conda install numpy To install the other dependencies substitute NumPy by the appropriate package. Please read the documentation carefully, not all options work equally well for each operating system. The pandas installation documentation is at http://pandas.pydata.org/pandas-docs/dev/install.html. Troubleshooting and asking for help Currently the best forum is at http://stackoverflow.com. You can also reach out on mailing lists or IRC chat. The following is a list of mailing lists: Scikit-learn: https://lists.sourceforge.net/lists/listinfo/scikit-learn-general. NumPy and Scipy mailing list: https://www.scipy.org/scipylib/mailing-lists.html. IRC channels: #scikit-learn @ freenode #scipy @ freenode Summary In this article we covered the basic concepts of machine learning, a high level overview, generalizing with data, overfitting, dimensions and features, preprocessing, combining models, installing the required software and some places where you can ask for help. Resources for Article: Further resources on this subject: Machine learning and Python – the Dream Team Machine Learning in IPython with scikit-learn How to do Machine Learning with Python
Read more
  • 0
  • 0
  • 2371

article-image-inbuilt-data-types-python
Packt
22 Jun 2017
4 min read
Save for later

Inbuilt Data Types in Python

Packt
22 Jun 2017
4 min read
This article by Benjamin Baka, author of the book Python Data Structures and Algorithm, explains the inbuilt data types in Python. Python data types can be divided into 3 categories, numeric, sequence and mapping. There is also the None object that represents a Null, or absence of a value. It should not be forgotten either that other objects such as classes, files and exceptions can also properly be considered types, however they will not be considered here. (For more resources related to this topic, see here.) Every value in Python has a data type. Unlike many programming languages, in Python you do not need to explicitly declare the type of a variable. Python keeps track of object types internally. Python inbuilt data types are outlined in the following table: Category Name Description None None The null object Numeric int Integer   float Floating point number   complex Complex number   bool Boolean (True, False) Sequences str String of characters   list List of arbitrary objects   Tuple Group of arbitrary items   range Creates a range of integers. Mapping dict Dictionary of key – value pairs   set Mutable, unordered collection of unique items   frozenset Immutable set None type The None type is immutable and has one value, None. It is used to represent the absence of a value. It is returned by objects that do not explicitly return a value and evaluates to False in Boolean expressions. It is often used as the default value in optional arguments to allow the function to detect if the caller has passed a value. Numeric Types All numeric types, apart from bool, are signed and they are all immutable. Booleans have two possible values, True and False. These values are mapped to 1 and 0 respectively. The integer type, int, represents whole numbers of unlimited range. Floating point numbers are represented by the native double precision floating point representation of the machine. Complex numbers are represented by two floating point numbers. they are assigned using the j operator to signify the imaginary part of the complex number. For example : a = 2+3j We can access the real and imaginary parts by a.real and a.imag respectively. Representation error It should be noted that the native double precision representation of floating point numbers leads to some unexpected results. For example, consider the following: In[14]: 1-0.9 Out[14]: 0.09999999999998 In [15]: 1-0.9 == 0.1 Out[15]: False This is a result of the fact that most decimal fractions are not exactly representable as a binary fraction, which is how most underlying hardware represents floating point numbers. For algorithms or applications where this may be an issue Python provides a decimalmodule. This module allows for the exact representation of decimal numbers and facilitates greater control properties such as rounding behaviour, number of significant digits and precision. It defines two objects, a Decimal type, representing decimal numbers and a Context type, representing various computational parameters such as precision, rounding and error handling.  An example of its usage can be seen in the following: In [1]: import decimal In[2]: x = decimal.Decimal(3.14); y=decimal.Decimal(2.74) In[3]: x*y Out[3]: Decimal (‘8.60360000000001010036498883’) In[4]: decimal.getcontext().prec = 4 In[5]: x * y Out[5]: Decimal(‘8.604’) Here we have created a global context and set the precision to 4. The Decimal object can be treated pretty much as you would treat an int or a float. They are subject to all the same mathematical operations and can be used as dictionary keys, placed in sets and so on. In addition, Decimal objects also have several methods for mathematical operations such as natural exponents x.exp(), natural logarithms, x.ln() and base 10 logarithms, x.log10().  Python also has a fractions module that implements a rational number type. The following shows several ways to create fractions: In [62]: import fractions In [63]: fractions Fraction(3,4) #creates the fraction ¾ Out[63]: Fraction(3,4) In [64]: fraction Fraction(0,5) #creates a fraction from a float Out[64]: Fraction(1,2) In [65]: fraction Fraction(“.25”) #creates a fraction from a string Out[65]: Fraction(1,4) It is also worth mentioning here the NumPy extension. This has types for mathematical objects such as arrays, vectors and matrixes and capabilities for linear algebra, calculation of Fourier transforms, eigenvectors, logical operations and much more. Summary We have looked at the built in data types and some internal Python modules, most notable the collections module. There are a number of external libraries such as the SciPy stack, and, likewise.  Resources for Article: Further resources on this subject: Python Data Structures [article] Getting Started with Python Packages [article] An Introduction to Python Lists and Dictionaries [article]
Read more
  • 0
  • 0
  • 1579
Visually different images

article-image-scraping-web-page
Packt
20 Jun 2017
11 min read
Save for later

Scraping a Web Page

Packt
20 Jun 2017
11 min read
In this article by Katharine Jarmul author of the book Python Web Scraping - Second Edition we can look at some example as suppose I have a shop selling shoes and want to keep track of my competitor's prices. I could go to my competitor's website each day and compare each shoe's price with my own, however this will take a lot of time and will not scale well if I sell thousands of shoes or need to check price changes frequently. Or maybe I just want to buy a shoe when it's on sale. I could come back and check the shoe website each day until I get lucky, but the shoe I want might not be on sale for months. These repetitive manual processes could instead be replaced with an automated solution using the web scraping techniques covered in this book. In an ideal world, web scraping wouldn't be necessary and each website would provide an API to share the data in a structured format. Indeed, some websites do provide APIs, but they typically restrict the data that is available and how frequently it can be accessed. Additionally, a website developer might change, remove or restrict the backend API. In short, we cannot rely on APIs to access the online data we may want and therefore, we need to learn about web scraping techniques. (For more resources related to this topic, see here.) Three approaches to scrape a web page Now that we understand the structure of this web page we will investigate three different approaches to scraping its data, first with regular expressions, then with the popular BeautifulSoup module, and finally with the powerful lxml module. Regular expressions If you are unfamiliar with regular expressions or need a reminder, there is a thorough overview available at (https://docs.python.org/3/howto/regex.html). Even if you use regular expressions (or regex) with another programming language, I recommend stepping through it for a refresher on regex with Python. To scrape the country area using regular expressions, we will first try matching the contents of the <td> element, as follows: >>> import re >>> from advanced_link_crawler import download >>> url = 'http://example.webscraping.com/view/UnitedKingdom-239' >>> html = download(url) >>> re.findall(r'(.*?)', html) ['<'img src="/places/static/images/flags/gb.png" />', '244,820 square kilometres', '62,348,447', 'GB', 'United Kingdom', 'London', 'EU', '.uk', 'GBP', 'Pound', '44', '@# #@@|@## #@@|@@# #@@|@@## #@@|@#@ #@@|@@#@ #@@|GIR0AA', '^(([A-Z]\d{2}[A-Z]{2})|([A-Z]\d{3}[A-Z]{2})|([A-Z]{2}\d{2} [A-Z]{2})|([A-Z]{2}\d{3}[A-Z]{2})|([A-Z]\d[A-Z]\d[A-Z]{2}) |([A-Z]{2}\d[A-Z]\d[A-Z]{2})|(GIR0AA))$', 'en-GB,cy-GB,gd', 'IE '] This result shows that thetag is used for multiple country attributes. If we simply wanted to scrape the country area, we can select the second matching element, as follows: >>> re.findall('(.*?)', html)[1]'244,820 square kilometres' This solution works but could easily fail if the web page is updated. Consider if this table is changed and the area is no longer in the second matching element. If we just need to scrape the data now, future changes can be ignored. However, if we want to rescrape this data at some point, we want our solution to be as robust against layout changes as possible. To make this regular expression more specific, we can include the parentelement, which has an ID, so it ought to be unique: >>> re.findall(' Area: (.*?) ', html) ['244,820 square kilometres'] This iteration is better; however, there are many other ways the web page could be updated in a way that still breaks the regular expression. For example, double quotation marks might be changed to single, extra spaces could be added between the tags, or the area_label could be changed. Here is an improved version to try and support these various possibilities: >>> re.findall('''.*?<tds*class=["']w2p_fw["']>(.*?) ''', html) ['244,820 square kilometres'] This regular expression is more future-proof but is difficult to construct, and quite unreadable. Also, there are still plenty of other minor layout changes that would break it, such as if a title attribute was added to the <td> tag or if the tr or td elements changed their CSS classes or IDs. From this example, it is clear that regular expressions provide a quick way to scrape data but are too brittle and easily break when a web page is updated. Fortunately, there are better data extraction solutions such as. Beautiful Soup Beautiful Soup is a popular library that parses a web page and provides a convenient interface to navigate content. If you do not already have this module, the latest version can be installed using this command: pip install beautifulsoup4 The first step with Beautiful Soup is to parse the downloaded HTML into a soup document. Many web pages do not contain perfectly valid HTML and Beautiful Soup needs to correct improper open and close tags. For example, consider this simple web page containing a list with missing attribute quotes and closing tags: <ul class=country> <li>Area <li>Population </ul> If the Population item is interpreted as a child of the Area item instead of the list, we could get unexpected results when scraping. Let us see how Beautiful Soup handles this: >>> from bs4 import BeautifulSoup >>> broken_html = '<ul class=country><li>Area<li>Population</ul>' >>> # parse the HTML >>> soup = BeautifulSoup(broken_html, 'html.parser') >>> fixed_html = soup.prettify() >>> print(fixed_html) <ul class="country"> <li> Area <li> Population </li> </li> </ul> We can see that using the default html.parser did not result in properly parsed HTML. We can see from the previous snippet that it has used nested li elements, which might make it difficult to navigate. Luckily there are more options for parsers. We can install LXML or we can also use html5lib. To install html5lib, simply use pip: pip install html5lib Now, we can repeat this code, changing only the parser like so: >>> soup = BeautifulSoup(broken_html, 'html5lib') >>> fixed_html = soup.prettify() >>> print(fixed_html) <html> <head> </head> <body> <ul class="country"> <li> Area </li> <li> Population </li> </ul> </body> </html>  Here, BeautifulSoup using html5lib was able to correctly interpret the missing attribute quotes and closing tags, as well as add the <html> and <body> tags to form a complete HTML document. You should see similar results if you used lxml. Now, we can navigate to the elements we want using the find() and find_all() methods: >>> ul = soup.find('ul', attrs={'class':'country'}) >>> ul.find('li') # returns just the first match <li>Area</li> >>> ul.find_all('li') # returns all matches [<li>Area</li>, <li>Population</li>] For a full list of available methods and parameters, the official documentation is available at http://www.crummy.com/software/BeautifulSoup/bs4/doc/. Now, using these techniques, here is a full example to extract the country area from our example website: >>> from bs4 import BeautifulSoup >>> url = 'http://example.webscraping.com/places/view/United-Kingdom-239' >>> html = download(url) >>> soup = BeautifulSoup(html) >>> # locate the area row >>> tr = soup.find(attrs={'id':'places_area__row'}) >>> td = tr.find(attrs={'class':'w2p_fw'}) # locate the data element >>> area = td.text # extract the text from the data element >>> print(area) 244,820 square kilometres This code is more verbose than regular expressions but easier to construct and understand. Also, we no longer need to worry about problems in minor layout changes, such as extra whitespace or tag attributes. We also know if the page contains broken HTML that BeautifulSoup can help clean the page and allow us to extract data from very broken website code. Lxml Lxml is a Python library built on top of the libxml2 XML parsing library written in C, which helps make it faster than Beautiful Soup but also harder to install on some computers, specifically Windows. The latest installation instructions are available at http://lxml.de/installation.html. If you run into difficulties installing the library on your own, you can also use Anaconda to do so:  https://anaconda.org/anaconda/lxml. If you are unfamiliar with Anaconda, it is a package and environment manager primarily focused on open data science packages built by the folks at Continuum Analytics. You can download and install Anaconda by following their setup instructions here: https://www.continuum.io/downloads. Note that using the Anaconda quick install will set your PYTHON_PATH to the Conda installation of Python. As with Beautiful Soup, the first step when using lxml is parsing the potentially invalid HTML into a consistent format. Here is an example of parsing the same broken HTML: >>> from lxml.html import fromstring, tostring >>> broken_html = '<ul class=country><li>Area<li>Population</ul>' >>> tree = fromstring(broken_html) # parse the HTML >>> fixed_html = tostring(tree, pretty_print=True) >>> print(fixed_html) <ul class="country"> <li>Area</li> <li>Population</li> </ul> As with BeautifulSoup, lxml was able to correctly parse the missing attribute quotes and closing tags, although it did not add the <html> and <body> tags. These are not requirements for standard XML and so are unnecessary for lxml to insert. After parsing the input, lxml has a number of different options to select elements, such as XPath selectors and a find() method similar to Beautiful Soup. Instead, we will use CSS selectors here, because they are more compact and can be reused later when parsing dynamic content. Some readers will already be familiar with them from their experience with jQuery selectors or use in front-end web application development. We will compare performance of these selectors with XPath. To use CSS selectors, you might need to install the cssselect library like so: pip install cssselect Now we can use the lxml CSS selectors to extract the area data from the example page: >>> tree = fromstring(html) >>> td = tree.cssselect('tr#places_area__row > td.w2p_fw')[0] >>> area = td.text_content() >>> print(area) 244,820 square kilometres By using the cssselect method on our tree, we can utilize CSS syntax to select a table row element with the places_area__row ID, and then the child table data tag with the w2p_fw class. Since cssselect returns a list, we then index the first result and call the text_content method, which will iterate over all child elements and return concatenated text of each element. In this case, we only have one element, but this functionality is useful to know for more complex extraction examples. Summary We have walked through a variety of ways to scrape data from a web page. Regular expressions can be useful for a one-off scrape or to avoid the overhead of parsing the entire web page, and BeautifulSoup provides a high-level interface while avoiding any difficult dependencies. However, in general, lxml will be the best choice because of its speed and extensive functionality, so we will use it in future examples. Resources for Article: Further resources on this subject: Web scraping with Python (Part 2) [article] Scraping the Web with Python - Quick Start [article] Scraping the Data [article]
Read more
  • 0
  • 0
  • 1218

article-image-analyzing-social-networks-facebook
Packt
20 Jun 2017
15 min read
Save for later

Analyzing Social Networks with Facebook

Packt
20 Jun 2017
15 min read
In this article by Raghav Bali, Dipanjan Sarkar and Tushar Sharma, the authors of the book Learning Social Media Analytics with R, we got a good flavor of the various aspects related to the most popular social micro-blogging platform, Twitter. In this article, we will look more closely at the most popular social networking platform, Facebook. With more than 1.8 billion monthly active users, over 18 billion dollars annual revenue and record breaking acquisitions for popular products including Oculus, WhatsApp and Instagram have truly made Facebook the core of the social media network today. (For more resources related to this topic, see here.) Before we put Facebook data under the microscope, let us briefly look at Facebook’s interesting origins! Like many popular products, businesses and organizations, Facebook too had a humble beginning. Originally starting off as Mark Zuckerberg’s brainchild in 2004, it was initially known as “Thefacebook” located at thefacebook.com, which was branded as an online social network, connecting university and college students. While this social network was only open to Harvard students in the beginning, it soon expanded within a month by including students from other popular universities. In 2005, the domain facebook.com was finally purchased and “Facebook” extended its membership to employees of companies and organizations for the first time. Finally in 2006, Facebook was finally opened to everyone above 13 years of age and having a valid email address. The following snapshot shows us how the look and feel of the Facebook platform has evolved over the years! Facebook’s evolving look over time While Facebook has a primary website, also known as a web application, it has also launched mobile applications for the major operating systems on handheld devices. In short, Facebook is not just a social network website but an entire platform including a huge social network of connected people and organizations through friends, followers and pages. We will leverage Facebook’s social “Graph API” to access actual Facebook data to perform various analyses. Users, brands, business, news channels, media houses, retail stores and many more are using Facebook actively on a daily basis for producing and consuming content. This generates vast amount of data and a substantial amount of this is available to users through its APIs.  From a social media analytics perspective, this is really exciting because this treasure trove of data with easy to access APIs and powerful open source libraries from R, gives us enormous potential and opportunity to get valuable information from analyzing this data in various ways. We will follow a structured path in this article and cover the following major topics sequentially to ensure that you do not get overwhelmed with too much content at once. Accessing Facebook data Analyzing your personal social network Analyzing an English football social network Analyzing English football clubs’ brand page engagements We will use libraries like Rfacebook, igraph and ggplot2 to retrieve, analyze and visualize data from Facebook. All the following sections of the book assume that you have a Facebook account which is necessary to access data from the APIs and analyze it. In case you do not have an account, do not despair. You can use the data and code files for this article to follow along with the hands-on examples to gain a better understanding of the concepts of social network and engagement analysis.    Accessing Facebook data You will find a lot of content in several books and on the web about various techniques to access and retrieve data from Facebook. There are several official ways of doing this which include using the Facebook Graph API either directly through low level HTTP based calls or indirectly through higher level abstract interfaces belonging to libraries like Rfacebook. Some alternate ways of retrieving Facebook data would be to use registered applications on Facebook like Netvizz or the GetNet application built by Lada Adamic, used in her very popular “Social Network Analysis” course (Unfortunately http://snacourse.com/getnet is not working since Facebook completely changed its API access permissions and privacy settings). Unofficial ways include techniques like web scraping and crawling to extract data. Do note though that Facebook considers this to be a violation of its terms and conditions of accessing data and you should try and avoid crawling Facebook for data especially if you plan to use it for commercial purposes. In this section, we will take a closer look at the Graph API and the Rfacebook package in R. The main focus will be on how you can extract data from Facebook using both of them. Understanding the Graph API To start using the Graph API, you would need to have an account on Facebook to be able to use the API. You can access the API in various ways. You can create an application on Facebook by going to https://developers.facebook.com/apps/ and then create a long-lived OAuth access token using the fbOAuth(…)function from the Rfacebook package. This enables R to make calls to the Graph API and you can also store this token on the disk and load it for future use. An easier way is to create a short-lived token which would let you access the API data for about two hours by going to the Facebook Graph API Explorer page which is available at https://developers.facebook.com/tools/explorer and get a temporary access token from there. The following snapshot depicts how to get an access token for the Graph API from Facebook. Facebook’s Graph API explorer On clicking “Get User Access Token” in the above snapshot, it will present a list of checkboxes with various permissions which you might need for accessing data including user data permissions, events, groups and pages and other miscellaneous permissions. You can select the ones you need and click on the “Get Access Token” button in the prompt. This will generate a new access token the field depicted in the above snapshot and you can directly copy and use it to retrieve data in R. Before going into that, we will take a closer look at the Graph API explorer which directly allows you to access the API from your web browser itself and helps if you want to do some quick exploratory analysis. A part of it is depicted in the above snapshot. The current version of the API when writing this book is v2.8 which you can see in the snapshot beside the GET resource call. Interestingly, the Graph API is so named because Facebook by itself can be considered as a huge social graph where all the information can be classified into the following three categories. Nodes: These are basically users, pages, photos and so on. Nodes indicate a focal point of interest which is connected to other points. Edges: These connect various nodes together forming the core social graph and these connections are based on various relations like friends, followers and so on. Fields: These are specific attributes or properties about nodes, an example would be a user’s address, birthday, name and so on. Like we mentioned before, the API is HTTP based and you can make HTTPGET requests to nodes or edges and all requests are passed to graph.facebook.com to get data. Each node usually has a specific identifier and you can use it for querying information about a node as depicted in the following snippet. GET graph.facebook.com /{node-id} You can also use edge names in addition to the identifier to get information about the edges of the node. The following snippet depicts how you can do the same. GET graph.facebook.com /{node-id}/{edge-name} The following snapshot shows us how we can get information about our own profile. Querying your details in the Graph API explorer Now suppose, I wanted to retrieve information about a Facebook page,“Premier League” which represents the top tier competition in English Football using its identifier and also take a look at its liked pages. I can do the same using the following request. Querying information about a Facebook Page using the Graph API explorer Thus from the above figure, you can clearly see the node identifier, page name and likes for the page, “Premier League”. It must be clear by now that all API responses are returned in the very popular JSON format which is easy to parse and format as needed for analysis. Besides this, there also used to be another way of querying the social graph in Facebook, which was known as FQL or Facebook Query Language, an SQL like interface for querying and retrieving data. Unfortunately, Facebook seems to have deprecated its use and hence covering it would be out of our present scope. Now that you have a firm grasp on the syntax of the Graph API and have also seen a few examples of how to retrieve data from Facebook, we will take a closer look at the Rfacebook package. Understanding Rfacebook Since we will be accessing and analyzing data from Facebook using R, it makes sense to have some robust mechanism to directly query Facebook and retrieve data instead of going to the browser every time like we did in the earlier section. Fortunately, there is an excellent package in R called Rfacebook which has been developed by Pablo Barberá. You can either install it from CRAN or get its most updated version from GitHub. The following snippet depicts how you can do the same. Remember you might need to install the devtools package if you don’t have it already, to download and install the latest version of the Rfacebook package from GitHub. install.packages("Rfacebook") # install from CRAN # install from GitHub library(devtools) install_github("pablobarbera/Rfacebook/Rfacebook") Once you install the package, you can load up the package using load(Rfacebook) and start using it to retrieve data from Facebook by using the access token you generated earlier. The following snippet shows us how you can access your own details like we had mentioned in the previous section, but this time by using R. > token = 'XXXXXX' > me <- getUsers("me", token=token) > me$name [1] "Dipanjan Sarkar" > me$id [1] "1026544" The beauty of this package is that you directly get the results in curated and neatly formatted data frames and you do not need to spend extra time trying to parse the raw JSON response objects from the Graph API. The package is well documented and has high level functions for accessing personal profile data on Facebook as well as page and group level data points. We will now take a quick look at Netvizz a Facebook application, which can also be used to extract data easily from Facebook. Understanding Netvizz The Netvizz application was developed by Bernhard Rieder and is a tool which can be used to extract data from Facebook pages, groups, get statistics about links and also extract social networks from Facebook pages based on liked pages from each connected page in the network. You can access Netvizz at https://apps.facebook.com/netvizz/ and on registering the application on your profile, you will be able to see the following screen. The Netvizz application interface From the above app snapshot, you can see that there are various links based on the type of operation you want to execute to extract data. Feel free to play around with this tool and we will be using its “page like network” capability later on in one of our analyses in a future section. Data Access Challenges There are several challenges with regards to accessing data from Facebook. Some of the major issues and caveats have been mentioned in the following points: Facebook will keep evolving and updating its data access APIs and this can and will lead to changes and deprecation of older APIs and access patterns just like FQL was deprecated. Scope of data available keeps changing with time and evolving of Facebook’s API and privacy settings. For instance we can no longer get details of all our friends from the API any longer. Libraries and Tools built on top of the API can tend to break with changes to Facebook’s APIs and this has happened before with Rfacebook as well as Netvizz. Besides this, Lada Adamic’s GetNet application has stopped working permanently ever since Facebook changed the way apps are created and the permissions they require. You can get more information about it here http://thepoliticsofsystems.net/2015/01/the-end-of-netvizz/ Thus what was used in the book today for data retrieval might not be working completely tomorrow if there are any changes in the APIs though it is expected it will be working fine for at least the next couple of years. However to prevent any hindrance on analyzing Facebook data, we have provided the datasets we used in most of our analyses except personal networks so that you can still follow along with each example and use-case. Personal names have been anonymized wherever possible to protect their privacy. Now that we have a good idea about Facebook’s Graph API and how to access data, let’s analyze some social networks! Analyzing your personal social network Like we had mentioned before, Facebook by itself is a massive social graph, connecting billions of users, brands and organization. Consider your own Facebook account if you have one. You will have several friends which are your immediate connections, they in turn will be having their own set of friends including you and you might be friends with some of them and so on. Thus you and your friends form the nodes of the network and edges determine the connections. In this section we will analyze a small network of you and your immediate friends and also look at how we can extract and analyze some properties from the network. Before we jump into our analysis, we will start by loading the necessary packages needed which are mentioned in the following snippet and storing the Facebook Graph API access token in a variable. library(Rfacebook) library(gridExtra) library(dplyr) # get the Graph API access token token = ‘XXXXXXXXXXX’ You can refer to the file fb_personal_network_analysis.R for code snippets used in the examples depicted in this section. Basic descriptive statistics In this section, we will try to get some basic information and descriptive statistics on the same from our personal social network on Facebook. To start with let us look at some details of our own profile on Facebook using the following code. # get my personal information me <- getUsers("me", token=token, private_info = TRUE) > View(me[c('name', 'id', 'gender', 'birthday')]) This shows us a few fields from the data frame containing our personal details retrieved from Facebook. We use the View function which basically invokes a spreadsheet-style data viewer on R objects like data frames. Now, let us get information about our friends in our personal network. Do note that Facebook currently only lets you access information about those friends who have allowed access to the Graph API and hence you may not be able to get information pertaining to all friends in your friend list. We have anonymized their names below for privacy reasons. anonymous_names <- c('Johnny Juel', 'Houston Tancredi',..., 'Julius Henrichs', 'Yong Sprayberry') # getting friends information friends <- getFriends(token, simplify=TRUE) friends$name <- anonymous_names # view top few rows > View(head(friends)) This gives us a peek at some people from our list of friends which we just retrieved from Facebook. Let’s now analyze some descriptive statistics based on personal information regarding our friends like where they are from, their gender and so on. # get personal information friends_info <- getUsers(friends$id, token, private_info = TRUE) # get the gender of your friends >View(table(friends_info$gender)) This gives us the gender of my friends, looks like more male friends have authorized access to the Graph API in my network! # get the location of your friends >View(table(friends_info$location)) This depicts the location of my friends (wherever available) in the following data frame. # get relationship status of your friends > View(table(friends_info$relationship_status)) From the statistics in the following table I can see that a lot of my friends have gotten married over the past couple of years. Boy that does make me feel old! Suppose I want to look at the relationship status of my friends grouped by gender, we can do the same using the following snippet. # get relationship status of friends grouped by gender View(table(friends_info$relationship_status, friends_info$gender)) The following table gives us the desired results and you can see the distribution of friends by their gender and relationship status. Summary This article has been proven very beneficial to know some basic analytics of social networks with the help of R. Moreover, you will also get to know the information regarding the packages that R use. Resources for Article: Further resources on this subject: How to integrate social media with your WordPress website [article] Social Media Insight Using Naive Bayes [article] Social Media in Magento [article]
Read more
  • 0
  • 0
  • 3408

article-image-grouping-sets-advanced-sql
Packt
20 Jun 2017
6 min read
Save for later

Grouping Sets in Advanced SQL

Packt
20 Jun 2017
6 min read
In this article by Hans JurgenSchonig, the author of the book Mastering PostgreSQL 9.6, we will learn about advanced SQL. Introducing grouping sets Every advanced user of SQL should be familiar with GROUP BY and HAVING clauses. But are you also aware of CUBE, ROLLUP, and GROUPING SETS? If not this articlemight be worth reading for you. Loading some sample data To make this article a pleasant experience for you, I have compiled some sample data, which has been taken from the BP energy report at http://www.bp.com/en/global/corporate/energy-economics/statistical-review-of-world-energy.html. Here is the data structure, which will be used: test=# CREATE TABLE t_oil ( region text, country text, year int, production int, consumption int ); CREATE TABLE The test data can be downloaded from our website using curl directly: test=# COPY t_oil FROM PROGRAM 'curl www.cybertec.at/secret/oil_ext.txt'; COPY 644 On some operating systems curl is not there by default or has not been installed so downloading the file before might be the easier option for many people. All together there is data for 14 nations between 1965 and 2010, which are in two regions of the world: test=# SELECT region, avg(production) FROM t_oil GROUP BY region; region | avg ---------------+--------------------- Middle East | 1992.6036866359447005 North America | 4541.3623188405797101 (2 rows) Applying grouping sets The GROUP BY clause will turn many rows into one row per group. However, if you do reporting in real life, you might also be interested in the overall average. One additional line might be needed. Here is how this can be achieved: test=# SELECT region, avg(production) FROM t_oil GROUP BY ROLLUP (region); region | avg ---------------+----------------------- Middle East | 1992.6036866359447005 North America | 4541.3623188405797101 | 2607.5139860139860140 (3 rows) The ROLLUP clause will inject an additional line, which will contain the overall average. If you do reporting it is highly likely that a summary line will be needed. Instead of running two queries, PostgreSQL can provide the data running just a single query. Of course this kind of operation can also be used if you are grouping by more than just one column: test=# SELECT region, country, avg(production) FROM t_oil WHERE country IN ('USA', 'Canada', 'Iran', 'Oman') GROUP BY ROLLUP (region, country); region | country | avg ---------------+---------+----------------------- Middle East | Iran | 3631.6956521739130435 Middle East | Oman | 586.4545454545454545 Middle East | | 2142.9111111111111111 North America | Canada | 2123.2173913043478261 North America | USA | 9141.3478260869565217 North America | | 5632.2826086956521739 | | 3906.7692307692307692 (7 rows) In this example, PostgreSQL will inject three lines into the result set. One line will be injected for Middle East, one for North America. On top of that we will get a line for the overall averages. If you are building a web application the current result is ideal because you can easily build a GUI to drill into the result set by filtering out the NULL values. The ROLLUPclause is nice in case you instantly want to display a result. I always used it to display final results to end users. However, if you are doing reporting, you might want to pre-calculate more data to ensure more flexibility. The CUBEkeyword is what you might have been looking for: test=# SELECT region, country, avg(production) FROM t_oil WHERE country IN ('USA', 'Canada', 'Iran', 'Oman') GROUP BY CUBE (region, country); region | country | avg ---------------+---------+----------------------- Middle East | Iran | 3631.6956521739130435 Middle East | Oman | 586.4545454545454545 Middle East | | 2142.9111111111111111 North America | Canada | 2123.2173913043478261 North America | USA | 9141.3478260869565217 North America | | 5632.2826086956521739 | | 3906.7692307692307692 | Canada | 2123.2173913043478261 | Iran | 3631.6956521739130435 | Oman | 586.4545454545454545 | USA | 9141.3478260869565217 (11 rows) Note that even more rows have been added to the result. The CUBEwill create the same data as: GROUP BY region, country + GROUP BY region + GROUP BY country + the overall average. So the whole idea is to extract many results and various levels of aggregation at once. The resulting cube contains all possible combinations of groups. The ROLLUP and CUBE are really just convenience features on top of GROUP SETS. With the GROUPING SETS clause you can explicitly list the aggregates you want: test=# SELECT region, country, avg(production) FROM t_oil WHERE country IN ('USA', 'Canada', 'Iran', 'Oman') GROUP BY GROUPING SETS ( (), region, country); region | country | avg ---------------+---------+----------------------- Middle East | | 2142.9111111111111111 North America | | 5632.2826086956521739 | | 3906.7692307692307692 | Canada | 2123.2173913043478261 | Iran | 3631.6956521739130435 | Oman | 586.4545454545454545 | USA | 9141.3478260869565217 (7 rows) In this I went for three grouping sets: The overall average, GROUP BY region and GROUP BY country. In case you want region and country combined, use (region, country). Investigating performance Grouping sets are a powerful feature, which help to reduce the number of expensive queries. Internally,PostgreSQL will basically turn to traditional GroupAggregates to make things work. A GroupAggregate node requires sorted data so be prepared that PostgreSQL might do a lot of temporary sorting: test=# explain SELECT region, country, avg(production) FROM t_oil WHERE country IN ('USA', 'Canada', 'Iran', 'Oman') GROUP BY GROUPING SETS ( (), region, country); QUERY PLAN --------------------------------------------------------------- GroupAggregate (cost=22.58..32.69 rows=34 width=52) Group Key: region Group Key: () Sort Key: country Group Key: country -> Sort (cost=22.58..23.04 rows=184 width=24) Sort Key: region ->Seq Scan on t_oil (cost=0.00..15.66 rows=184 width=24) Filter: (country = ANY ('{USA,Canada,Iran,Oman}'::text[])) (9 rows) Hash aggregates are only supported for normal GROUP BY clauses involving no grouping sets. According to the developer of grouping sets (AtriShama), adding support for hashes is not worth the effort so it seems PostgreSQL already has an efficient implementation even if the optimizer has fewer choices than it has with normal GROUP BY statements. Combining grouping sets with the FILTER clause In real world applications grouping sets can often be combined with so called FILTER clauses. The idea behind FILTER is to be able to run partial aggregates. Here is an example: test=# SELECT region, avg(production) AS all, avg(production) FILTER (WHERE year < 1990) AS old, avg(production) FILTER (WHERE year >= 1990) AS new FROM t_oil GROUP BY ROLLUP (region); region | all | old | new ---------------+----------------+----------------+---------------- Middle East | 1992.603686635 | 1747.325892857 | 2254.233333333 North America | 4541.362318840 | 4471.653333333 | 4624.349206349 | 2607.513986013 | 2430.685618729 | 2801.183150183 (3 rows) The idea here is that not all columns will use the same data for aggregation. The FILTER clauses allow you to selectively pass data to those aggregates. In my example, the second aggregate will only consider data before 1990 while the second aggregate will take care of more recent data. If it is possible to move conditions to a WHERE clause it is always more desirable as less data has to be fetched from the table. The FILTERis only useful if the data left by the WHERE clause is not needed by each aggregate. The FILTER works for all kinds of aggregates and offers a simple way to pivot your data. Summary We have learned about advanced feature provided by SQL. On top of the simple aggregates,PostgreSQL provides, grouping sets to create custom aggregates.  Resources for Article: Further resources on this subject: PostgreSQL in Action [article] PostgreSQL as an Extensible RDBMS [article] Recovery in PostgreSQL 9 [article]
Read more
  • 0
  • 0
  • 1065
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
article-image-lambdaarchitecture-pattern
Packt
19 Jun 2017
8 min read
Save for later

LambdaArchitecture Pattern

Packt
19 Jun 2017
8 min read
In this article by Tomcy John and Pankaj Misra, the authors of the book, Data Lake For Enterprises, we will learn about how the data in landscape of big data solutions can be made in near real time and certain practices that can be adopted for realizing Lambda Architecture in context of data lake. (For more resources related to this topic, see here.) The concept of a data lake in an enterprise was driven by certain challenges that Enterprises were facing with the way the data was handled, processed, and stored. Initially all the individual applications in the Enterprise, via a natural evolution cycle, started maintaining huge amounts of data into themselves with almost no reuse to other applications in the same enterprise. These created information silos across arious applications. As the next step of evolution, these individual applications started exposing this data across the organization as a data mart access layer over central data warehouse. While data mart solved one part of the problem, other problems still persisted. These problems were more about data governance, data ownership, data accessibility which were required to be resolved so as to have better availability of enterprise relevant data. This is where a need was felt to have data lakes, that could not only make such data available but also could store any form of data and process it so that data is analyzed and kept ready for consumption by consumer applications. In this article, we will look at some of the critical aspects of a data lake and understand why does it matter for an enterprise. If we need to define the term Data Lake, it can be defined as a vast repository of variety of enterprise wide raw information that can be acquired, processed, analyzed and delivered. The information thus handled could be any type of information ranging from structured, semi-structured data to completely unstructured data. Data Lake is expected to be able to derive Enterprise relevant meaning and insights from this information using various analysis and machine learning algorithms. Lambda architecture and data lake Lambda architecture as a pattern provides ways and means to perform highly scalable, performant, distributed computing on large sets of data and yet provide consistent (eventually) data with required processing both in batch as well as in near real time. Lambda architecture defines ways and means to enable scale out architecture across various data load profiles in an enterprise, with low latency expectations. The architecture pattern became significant with the emergence of big data and enterprise’s focus on real-time analytics and digital transformation. The pattern named Lambda (symbol λ) is indicative of a way by which data comes from two places (batch and speed - the curved parts of the lambda symbol) which then combines and served through the serving layer (the line merging from the curved part). Figure 01 : Lambda Symbol  The main layers constituting the Lambda layer are shown below: Figyure 02 : Components of Lambda Architecure In the above high level representation, data is fed to both the batch and speed layer. The batch layer keeps producing and re-computing views at every set batch interval. The speed layer also creates the relevant real-time/speed views. The serving layer orchestrates the query by querying both the batch and speed layer, merges it and sends the result back as results. A practical realization of such a data lake can be illustrated as shown below. The figure below shows multiple technologies used for such a realization, however once the data is acquired from multiple sources and queued in messaging layer for ingestion, the Lambda architecture pattern in form of ingestion layer, batch layer.and speed layer springs into action: Figure 03: Layers in Data Lake Data Acquisition Layer:In an organization, data exists in various forms which can be classified as structured data, semi-structured data, or as unstructured data.One of the key roles expected from the acquisition layer is to be able convert the data into messages that can be further processed in a data lake, hence the acquisition layer is expected to be flexible to accommodate variety of schema specifications at the same time must have a fast connect mechanism to seamlessly push all the translated data messages into the data lake. A typical flow can be represented as shown below. Figure 04: Data Acquisition Layer Messaging Layer: The messaging layer would form the Message Oriented Middleware (MOM) for the data lake architecture and hence would be the primary layer for decoupling the various layers with each other, but with guaranteed delivery of messages.The other aspect of a messaging layer is its ability to enqueue and dequeue messages, as in the case with most of the messaging frameworks. Most of the messaging frameworks provide enqueue and dequeue mechanisms to manage publishing and consumption of messages respectively. Every messaging frameworks provides its own set of libraries to connect to its resources(queues/topics). Figure 05: Message Queue Additionally the messaging layer also can perform the role of data stream producer which can converted the queued data into continuous streams of data which can be passed on for data ingestion. Data Ingestion Layer: A fast ingestion layer is one of the key layers in Lambda Architecture pattern. This layer needs to ensure how fast can data be delivered into working models of Lambda architecture.  The data ingestion layer is responsible for consuming the messages from the messaging layer and perform the required transformation for ingesting them into the lambda layer (batch and speed layer) such that the transformed output conforms to the expected storage or processing formats. Figure 06: Data Ingestion Layer Batch Processing: The batch processing layer of lambda architecture is expected to process the ingested data in batches so as to have optimum utilization of system resources, at the same time, long running operations may be applied to the data to ensure high quality of data output, which is also known as Modelled data. The conversion of raw data to a modelled data is the primary responsibility of this layer, wherein the modelled data is the data model which can be served by serving layers of lambda architecture. While Hadoop as a framework has multiple components that can process data as a batch, each data processing in Hadoop is a map reduce process. A map and reduce paradigm of process execution is not a new paradigm, rather it has been used in many application ever since mainframe systems came into existence. It is based on divide and rule and stems from the traditional multi-threading model. The primary mechanism here is to divide the batch across multiple processes and then combine/reduce output of all the processes into a single output. Figure 07: Batch Processing Speed (Near Real Time) Data Processing: This layer is expected to perform near real time processing on data received from ingestion layer. Since the processing is expected to be in near real time, such data processing will need to be quick, fast and efficient, with support and design for high concurrency scenarios and eventually consistent outcome. The real-time processing was often dependent on data like the look-up data and reference data, hence there was a need to have a very fast data layer such that any look-up or reference data does not adversely impact the real-time nature of the processing. Near real time data processing pattern is not very different from the way it is done in batch mode, but the primary difference being that the data is processed as soon as it is available for processing and does not have to be batched, as shown below. Figure 08: Speed (Near Real Time) Processing Data Storage Layer: The data storage layer is very eminent in the lambda architecture pattern as this layer defines the reactivity of the overall solution to the incoming event/data streams. The storage, in context of lambda architecture driven data lake can be classified broadly into non-indexed and indexed data storage. Typically, the batch processing is performed on non-indexed data stored as data blocks for faster batch processing, while speed (near real time processing) is performed on indexed data which can be accessed randomly and supports complex search patterns by means of inverted indexes. Both of these models are depicted below. Figure 09: Non-Indexed and Indexed Data Storage Examples Lambda in action Once all the layers in lambda architecture have performed their respective roles, the data can be exported, exposed via services and can be delivered through other protocols from the data lake. This can also include merging the high quality processed output from batch processing with indexed storage, using technologies and frameworks, so as to provide enriched data for near real time requirements as well with interesting visualizations. Figure 10: Lambda in action Summary In this article we have briefly discussed a practical approach towards implementing a data lake for enterprises by leveraging Lambda architecture pattern. Resources for Article: Further resources on this subject: The Microsoft Azure Stack Architecture [article] System Architecture and Design of Ansible [article] Microservices and Service Oriented Architecture [article]
Read more
  • 0
  • 0
  • 2338

article-image-article-movie-recommendation
Packt
16 Jun 2017
14 min read
Save for later

Article: Movie Recommendation

Packt
16 Jun 2017
14 min read
In this article by Robert Layton author of the book Learning Data Mining with Python - Second Edition is the second revision of Learning Data Mining with Python by Robert Layton improves upon the first book with updated examples, more in-depth discussion and exercises for your future development with data analytics. In this snippet from the book, we look at movie recommendation with a technique known as Affinity Analysis. (For more resources related to this topic, see here.) Affinity Analysis Affinity Analysis is the task of determining when objects are used in similar ways. We focused on whether the objects themselves are similar. The data for Affinity Analysis are often described in the form of a transaction. Intuitively, this comes from a transaction at a store—determining when objects are purchased together as a way to recommend products to users that they might purchase. Other use cases for Affinity Analysis include: Fraud detection Customer segmentation Software optimization Product recommendations Affinity Analysis is usually much more exploratory than classification. At the very least, we often simply rank the results and choose the top 5 recommendations (or some other number), rather than expect the algorithm to give us a specific answer. Algorithms for Affinity Analysis A brute force solution, testing all possible combinations, is not efficient enough for real-world use. We could expect even a small store to have hundreds of items for sale, while many online stores would have thousands (or millions!). As we add more items, the time it takes to compute all rules increases significantly faster. Specifically, the total possible number of rules is 2n - 1. Even the drastic increase in computing power couldn't possibly keep up with the increases in the number of items stored online. Therefore, we need algorithms that work smarter, as opposed to computers that work harder. The Apriori algorithm addresses the exponential problem of creating sets of items that occur frequently within a database, called frequent itemsets. Once these frequent itemsets are discovered, creating association rules is straightforward. The intuition behind Apriori is both simple and clever. First, we ensure that a rule has sufficient support within the dataset. Defining a minimum support level is the key parameter for Apriori. To build a frequent itemset, for an itemset (A, B) to have a support of at least 30, both A and B must occur at least 30 times in the database. This property extends to larger sets as well. For an itemset (A, B, C, D) to be considered frequent, the set (A, B, C) must also be frequent (as must D). Apriori discovers larger frequent itemsets by building off smaller frequent itemsets. The picture below outlines the full process: The Movie Recommendation Problem Product recommendation is a big business. Online stores use it to up-sell to customers by recommending other products that they could buy. Making better recommendations leads to better sales. When online shopping is selling to millions of customers every year, there is a lot of potential money to be made by selling more items to these customers. Grouplens, a research group at the University of Minnesota, has released several datasets that are often used for testing algorithms in this area. They have released several versions of a movie rating dataset, which have different sizes. There is a version with 100,000 reviews, one with 1 million reviews and one with 10 million reviews. The datasets are available from http://grouplens.org/datasets/movielens/ and the dataset we are going to use in this article is the MovieLens 100K dataset (with 100,000 reviews). Download this dataset and unzip it in your data folder. Start a new Jupyter Notebook and type the following code: import os import pandas as pd data_folder = os.path.join(os.path.expanduser("~"), "Data", "ml-100k") ratings_filename = os.path.join(data_folder, "u.data") Ensure that ratings_filename points to the u.data file in the unzipped folder. Loading with pandas The MovieLens dataset is in a good shape; however, there are some changes from the default options in pandas.read_csv that we need to make. When loading the file, we set the delimiter parameter to the tab character, tell pandas not to read the first row as the header (with header=None) and to set the column names with given values. Let's look at the following code: all_ratings = pd.read_csv(ratings_filename, delimiter="t", header=None, names = ["UserID", "MovieID", "Rating", "Datetime"]) While we won't use it in this article, you can properly parse the date timestamp using the following line. Dates for reviews can be an important feature in recommendation prediction, as movies that are rated together often have more similar rankings than movies ranked separately. Accounting for this can improve models significantly. all_ratings["Datetime"] = pd.to_datetime(all_ratings['Datetime'], unit='s') Understanding the Apriori algorithm and its implementation The goal of this article is to produce rules of the following form: if a person recommends this set of movies, they will also recommend this movie. We will also discuss extensions where a person recommends a set of movies is likely to recommend another particular movie. To do this, we first need to determine if a person recommends a movie. We can do this by creating a new feature Favorable, which is True if the person gave a favorable review to a movie: all_ratings["Favorable"] = all_ratings["Rating"] > 3 We will sample our dataset to form a training data. This also helps reduce the size of the dataset that will be searched, making the Apriori algorithm run faster. We obtain all reviews from the first 200 users: ratings = all_ratings[all_ratings['UserID'].isin(range(200))] Next, we can create a dataset of only the favorable reviews in our sample: favorable_ratings = ratings[ratings["Favorable"]] We will be searching the user's favorable reviews for our itemsets. So, the next thing we need is the movies which each user has given a favorable rating. We can compute this by grouping the dataset by the UserID and iterating over the movies in each group: favorable_reviews_by_users = dict((k, frozenset(v.values)) for k, v in favorable_ratings.groupby("UserID")["MovieID"]) In the preceding code, we stored the values as a frozenset, allowing us to quickly check if a movie has been rated by a user. Sets are much faster than lists for this type of operation, and we will use them in a later code. Finally, we can create a DataFrame that tells us how frequently each movie has been given a favorable review: num_favorable_by_movie = ratings[["MovieID", "Favorable"]].groupby("MovieID").sum() We can see the top five movies by running the following code: num_favorable_by_movie.sort_values(by="Favorable", ascending=False).head() Implementing the Apriori algorithm On the first iteration of Apriori, the newly discovered itemsets will have a length of 2, as they will be supersets of the initial itemsets created in the first step. On the second iteration (after applying the fourth step and going back to step 2), the newly discovered itemsets will have a length of 3. This allows us to quickly identify the newly discovered itemsets, as needed in the second step. We can store our discovered frequent itemsets in a dictionary, where the key is the length of the itemsets. This allows us to quickly access the itemsets of a given length, and therefore the most recently discovered frequent itemsets, with the help of the following code: frequent_itemsets = {} We also need to define the minimum support needed for an itemset to be considered frequent. This value is chosen based on the dataset but try different values to see how that affects the result. I recommend only changing it by 10 percent at a time though, as the time the algorithm takes to run will be significantly different! Let's set a minimum support value: min_support = 50 To implement the first step of the Apriori algorithm, we create an itemset with each movie individually and test if the itemset is frequent. We use frozenset, as they allow us to perform faster set-based operations later on, and they can also be used as keys in our counting dictionary (normal sets cannot). Let's look at the following example of frozenset code: frequent_itemsets[1] = dict((frozenset((movie_id,)), row["Favorable"]) for movie_id, row in num_favorable_by_movie.iterrows() if row["Favorable"] > min_support) We implement the second and third steps together for efficiency by creating a function that takes the newly discovered frequent itemsets, creates the supersets, and then tests if they are frequent. First, we set up the function to perform these steps: from collections import defaultdict def find_frequent_itemsets(favorable_reviews_by_users, k_1_itemsets, min_support): counts = defaultdict(int) for user, reviews in favorable_reviews_by_users.items(): for itemset in k_1_itemsets: if itemset.issubset(reviews): for other_reviewed_movie in reviews - itemset: current_superset = itemset | frozenset((other_reviewed_movie,)) counts[current_superset] += 1 return dict([(itemset, frequency) for itemset, frequency in counts.items() if frequency >= min_support]) In keeping with our rule of thumb of reading through the data as little as possible, we iterate over the dataset once per call to this function. While this doesn't matter too much in this implementation (our dataset is relatively small compared to the average computer), single-pass is a good practice to get into for larger applications. Let's have a look at the core of this function in detail. We iterate through each user, and each of the previously discovered itemsets, and then check if it is a subset of the current set of reviews, which are stored in k_1_itemsets (note that here, k_1 means k-1). If it is, this means that the user has reviewed each movie in the itemset. This is done by the itemset.issubset(reviews) line. We can then go through each individual movie that the user has reviewed (that is not already in the itemset), create a superset by combining the itemset with the new movie and record that we saw this superset in our counting dictionary. These are the candidate frequent itemsets for this value of k. We end our function by testing which of the candidate itemsets have enough support to be considered frequent and return only those that have a support more than our min_support value. This function forms the heart of our Apriori implementation and we now create a loop that iterates over the steps of the larger algorithm, storing the new itemsets as we increase k from 1 to a maximum value. In this loop, k represents the length of the soon-to-be discovered frequent itemsets, allowing us to access the previously most discovered ones by looking in our frequent_itemsets dictionary using the key k - 1. We create the frequent itemsets and store them in our dictionary by their length. Let's look at the code: for k in range(2, 20): # Generate candidates of length k, using the frequent itemsets of length k-1 # Only store the frequent itemsets cur_frequent_itemsets = find_frequent_itemsets(favorable_reviews_by_users, frequent_itemsets[k-1], min_support) if len(cur_frequent_itemsets) == 0: print("Did not find any frequent itemsets of length {}".format(k)) sys.stdout.flush() break else: print("I found {} frequent itemsets of length {}".format(len(cur_frequent_itemsets), k)) sys.stdout.flush() frequent_itemsets[k] = cur_frequent_itemsets Extracting association rules After the Apriori algorithm has completed, we have a list of frequent itemsets. These aren't exactly association rules, but they can easily be converted into these rules. For each itemset, we can generate a number of association rules by setting each movie to be the conclusion and the remaining movies as the premise.  candidate_rules = [] for itemset_length, itemset_counts in frequent_itemsets.items(): for itemset in itemset_counts.keys(): for conclusion in itemset: premise = itemset - set((conclusion,)) candidate_rules.append((premise, conclusion)) In these rules, the first partis the list of movies in the premise, while the number after it is the conclusion. In the first case, if a reviewer recommends movie 79, they are also likely to recommend movie 258. The process of computing confidence starts by creating dictionaries to store how many times we see the premise leading to the conclusion (a correct example of the rule) and how many times it doesn't (an incorrect example). We then iterate over all reviews and rules, working out whether the premise of the rule applies and, if it does, whether the conclusion is accurate. correct_counts = defaultdict(int) incorrect_counts = defaultdict(int) for user, reviews in favorable_reviews_by_users.items(): for candidate_rule in candidate_rules: premise, conclusion = candidate_rule if premise.issubset(reviews): if conclusion in reviews: correct_counts[candidate_rule] += 1 else: incorrect_counts[candidate_rule] += 1 We then compute the confidence for each rule by dividing the correct count by the total number of times the rule was seen: rule_confidence = {candidate_rule: (correct_counts[candidate_rule] / float(correct_counts[candidate_rule] + incorrect_counts[candidate_rule])) for candidate_rule in candidate_rules} Now we can print the top five rules by sorting this confidence dictionary and printing the results: from operator import itemgetter sorted_confidence = sorted(rule_confidence.items(), key=itemgetter(1), reverse=True) for index in range(5): print("Rule #{0}".format(index + 1)) premise, conclusion = sorted_confidence[index][0] print("Rule: If a person recommends {0} they will also recommend {1}".format(premise, conclusion)) print(" - Confidence: {0:.3f}".format(rule_confidence[(premise, conclusion)])) print("") The resulting printout shows only the movie IDs, which isn't very helpful without the names of the movies also. The dataset came with a file called u.items, which stores the movie names and their corresponding MovieID (as well as other information, such as the genre). We can load the titles from this file using pandas. Additional information about the file and categories is available in the README file that came with the dataset. The data in the files is in CSV format, but with data separated by the | symbol; it has no header and the encoding is important to set. The column names were found in the README file. movie_name_filename = os.path.join(data_folder, "u.item") movie_name_data = pd.read_csv(movie_name_filename, delimiter="|", header=None, encoding = "mac-roman") movie_name_data.columns = ["MovieID", "Title", "Release Date", "Video Release", "IMDB", "<UNK>", "Action", "Adventure", "Animation", "Children's", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror", "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"] Let's also create a helper function for finding the name of a movie by its ID: def get_movie_name(movie_id): title_object = movie_name_data[movie_name_data["MovieID"] == movie_id]["Title"] title = title_object.values[0] return title We can now adjust our previous code for printing out the top rules to also include the titles: for index in range(5): print("Rule #{0}".format(index + 1)) premise, conclusion = sorted_confidence[index][0] premise_names = ", ".join(get_movie_name(idx) for idx in premise) conclusion_name = get_movie_name(conclusion) print("Rule: If a person recommends {0} they will also recommend {1}".format(premise_names, conclusion_name)) print(" - Confidence: {0:.3f}".format(rule_confidence[(premise, conclusion)])) print("") The results gives a recommendation for movies, based on previous movies that person liked. Give it a shot and see if it matches your expectations! Learning Data Mining with Python In this short section of Learning Data Mining with Python, Revision 2, we performed Affinity Analysis in order to recommend movies based on a large set of reviewers. We did this in two stages. First, we found frequent itemsets in the data using the Apriori algorithm. Then, we created association rules from those itemsets. We performed training on a subset of our data in order to find the association rules, and then tested those rules on the rest of the data—a testing set. We could extend this concept to use cross-fold validation to better evaluate the rules. This would lead to a more robust evaluation of the quality of each rule. We cover topics such as classification, clusters, text analysis, image recognition, TensorFlow and Big Data. Each section comes with a practical real-world example, steps through the code in detail and provides suggestions for your to continue your (machine) learning. Summary In this article we have covered more in-depth discussion and exercises for your future development with data analytics. In this snippet from the book, we look at movie recommendation with a technique known as Affinity Analysis. The most recent upgrades to the HTMLG online editor are the tag manager and the attribute filter. Try it for free and purchase a subscription if you like it! Resources for Article: Further resources on this subject: Expanding Your Data Mining Toolbox [article] Data mining [article] Big Data Analysis [article]
Read more
  • 0
  • 0
  • 2670

article-image-backpropagation-algorithm
Packt
08 Jun 2017
11 min read
Save for later

Backpropagation Algorithm

Packt
08 Jun 2017
11 min read
In this article by Gianmario Spacagna, Daniel Slater, Phuong Vo.T.H, and Valentino Zocca, the authors of the book Python Deep Learning, we will learnthe Backpropagation algorithmas it is one of the most important topics for multi-layer feed-forward neural networks. (For more resources related to this topic, see here.) Propagating the error back from last to first layer, hence the name Backpropagation. Backpropagation is one of the most difficult algorithms to understand at first, but all is needed is some knowledge of basic differential calculus and the chain rule. For a deep neural network the algorithm to set the weights is called the Backpropagation algorithm. The Backpropagation algorithm We have seen how neural networks can map inputs onto determined outputs, depending on fixed weights. Once the architecture of the neural network has been defined (feed-forward, number of hidden layers, number of neurons per layer), and once the activity function for each neuron has been chosen, we will need to set the weights that in turn will define the internal states for each neuron in the network. We will see how to do that for a 1-layer network and then how to extend it to a deep feed-forward network. For a deep neural network the algorithm to set the weights is called the Backpropagation algorithm, and we will discuss and explain this algorithm for most of this section as it is one of the most important topics for multilayer feed-forward neural networks. First, however, we will quickly discuss this for 1-layer neural networks. The general concept we need to understand is the following: every neural network is an approximation of a function, therefore each neural network will not be equal to the desired function, instead it will differ by some value. This value is called the error and the aim is to minimize this error. Since the error is a function of the weights in the neural network, we want to minimize the error with respect to the weights. The error function is a function of many weights, it is therefore a function of many variables. Mathematically, the set of points where this function is zero represents therefore a hypersurface and to find a minimum on this surface we want to pick a point and then follow a curve in the direction of the minimum. Linear regression To simplify things we are going to introduce matrix notation. Let x be the input, we can think of x as a vector. In the case of linear regression we are going to consider a single output neuron y, the set of weights w is therefore a vector of dimension the same as the dimension of x. The activation value is then defined as the inner product <x, w>. Let's say that for each input value x we want to output a target value t, while for each x the neural network will output a value y defined by the activity function chosen, in this case the absolute value of the difference (y-t) represents the difference between the predicted value and the actual value for the specific input example x. If we have m input values xi, each of them will have a target value ti. In this case we calculate the error using the mean squared error , where each yi is a function of w. The error is therefore a function of w and it is usually denoted with J(w). As mentioned above, this represents a hypersurface of dimension equal to the dimension of w (we are implicitly also considering the bias), and for each wj we need to find a curve that will lead towards the minimum of the surface. The direction in which a curve increases in a certain direction is given by its derivative with respect to that direction, in this case by: And in order to move towards the minimum we need to move in the opposite direction set by  for each wj. Let's calculate: If , then  and therefore: The notation can sometimes be confusing, especially the first time one sees it. The input is given by vectors xi, where the superscript indicated the ith example. Since x and w are vectors, the subscript indicates the jth coordinate of the vector. yi then represents the output of the neural network given the input xi, while ti represents the target, that is, the desired value corresponding to the input xi. In order to move towards the minimum, we need to move each weight in the direction of its derivative by a small amount l, called the learning rate, typically much smaller than 1, (say 0.1 or smaller). We can therefore drop the 2 in the derivative and incorporate it in the learning rate, to get the update rule therefore given by: or, more in general, we can write the update rule in matrix form as: where ∇ represents the vector of partial derivatives. This process is what is often called gradient descent. One last note, the update can be done after having calculated all the input vectors, however, in some cases, the weights could be updated after each example or after a defined preset number of examples. Logistic regression In logistic regression, the output is not continuous; rather it is defined as a set of classes. In this case, the activation function is not going to be the identity function like before, rather we are going to use the logistic sigmoid function. The logistic sigmoid function, as we have seen before, outputs a real value in (0,1) and therefore it can be interpreted as a probability function, and that is why it can work so well in a 2-class classification problem. In this case, the target can be one of two classes, and the output represents the probability that it be one of those two classes (say t=1).Let’s denote with σ(a), with a the activation value,the logistic sigmoid function, therefore, for each examplex, the probability that the output be the class y, given the weights w, is: We can write that equation more succinctly as: and, since for each sample xi the probabilities are independent, we have that the global probability is: If we take the natural log of the above equation (to turn products into sums), we get: The object is now to maximize this log to obtain the highest probability of predicting the correct results. Usually, this is obtained, as in the previous case, by using gradient descent to minimize the cost function defined by. As before, we calculate the derivative of the cost function with respect to the weights wj to obtain: In general, in case of a multi-class output t, with t a vector (t1, …, tn), we can generalize this equation using J (w) = −log(P( y x,w))= Ei,j ti j, log ( (di)) that brings to the update equation for the weights: This is similar to the update rule we have seen for linear regression. Backpropagation In the case of 1-layer, weight-adjustment was easy, as we could use linear or logistic regression and adjust the weights simultaneously to get a smaller error (minimizing the cost function). For multi-layer neural networks we can use a similar argument for the weights used to connect the last hidden layer to the output layer, as we know what we would like the output layer to be, but we cannot do the same for the hidden layers, as, a priori, we do not know what the values for the neurons in the hidden layers ought to be. What we do, instead, is calculate the error in the last hidden layer and estimate what it would be in the previous layer, propagating the error back from last to first layer, hence the name Backpropagation. Backpropagation is one of the most difficult algorithms to understand at first, but all is needed is some knowledge of basic differential calculus and the chain rule. Let's introduce some notation first. We denote with Jthe cost (error), with y the activity function that is defined on the activation value a (for example y could be the logistic sigmoid), which is a function of the weights w and the input x. Let's also define wi,j the weight between the ith input value and the jth output. Here we define input and output more generically than for 1-layer network, if wi,j connects a pair of successive layers in a feed-forward network, we denote as input the neurons on the first of the two successive layers, and output the neurons on the second of the two successive layers. In order not to make the notation too heavy, and have to denote on which layer each neuron is, we assume that the ith input yi is always in the layer preceding the layer containing the jth output yj The letter y is used to both denote an input and the activity function, and we can easily infer which one we mean by the contest. We also use subscripts i and jwhere we always have ibelonging to the layer preceding the layer containing the element with subscript j. We also use subscripts i and j, where we always have the element with subscript i belonging to the layer preceding the layer containing the element with subscript j. In this example, layer 1 represents the input, and layer 2 the output Using this notation, and the chain-rule for derivatives, for the last layer of our neural network we can write: Since we know that , we have: If y is the logistic sigmoid defined above, we get the same result we have already calculated at the end of the previous section, since we know the cost function and we can calculate all derivatives. For the previous layers the same formula holds: Since we know that and we know that  is the derivative of the activity function that we can calculate, all we need to calculate is the derivative . Let's notice that this is the derivative of the error with respect to the activation function in the second layer, and, if we can calculate this derivative for the last layer, and have a formula that allows us to calculate the derivative for one layer assuming we can calculate the derivative for the next, we can calculate all the derivatives starting from the last layer and move backwards. Let us notice that, as we defined the yj, they are the activation values for the neurons in the second layer, but they are also the activity functions, therefore functions of the activation values in the first layer. Therefore, applying the chain rule, we have: and once again we can calculate both and, so , once we knowwe can calculate, and since we can calculate for the last layer, we can move backward and calculate for any layer and therefore  for any layer. Summarizing, if we have a sequence of layers where: We then have these two fundamental equations, where the summation in the second equation should read as the sum over all the outgoing connections fromyj to any neuron yk in the successive layer: By using these two equations we can calculate the derivatives for the cost with respect to each layer. If we set ,   represents the variation of the cost with respect to the activation value, and we can think of as the error at the neuron yj. We can then rewrite as: which implies that . These two equations give an alternate way of seeing Backpropagation, as the variation of the cost with respect to the activation value, and provide a formula to calculate this variation for any layer once we know the variation for the following layer: We can also combine these equations and show that: The Backpropagation algorithm for updating the weights is then given on each layer by: In the last section we will provide a code example that will help understand and apply these concepts and formulas. Summary At the end of this articlewe learnt the post neural networks architecture phaseand the use of the Backpropagation algorithm and we saw see how we can stack many layers to create and use deep feed-forward neural networks, and how a neural network can have many layers, and why inner (hidden) layers are important. Resources for Article: Further resources on this subject: Basics of Jupyter Notebook and Python [article] Jupyter and Python Scripting [article] Getting Started with Python Packages [article]
Read more
  • 0
  • 0
  • 2723

article-image-top-10-deep-learning-frameworks
Amey Varangaonkar
25 May 2017
9 min read
Save for later

Top 10 deep learning frameworks

Amey Varangaonkar
25 May 2017
9 min read
Deep learning frameworks are powering the artificial intelligence revolution. Without them, it would be almost impossible for data scientists to deliver the level of sophistication in their deep learning algorithms that advances in computing and processing power have made possible. Put simply, deep learning frameworks make it easier to build deep learning algorithms of considerable complexity. This follows a wider trend that you can see in other fields of programming and software engineering; open source communities are continually are developing new tools that simplify difficult tasks and minimize arduous ones. The deep learning framework you choose to use is ultimately down to what you're trying to do and how you work already. But to get you started here is a list of 10 of the best and most popular deep learning frameworks being used today. What are the best deep learning frameworks? Tensorflow One of the most popular Deep Learning libraries out there, Tensorflow, was developed by the Google Brain team and open-sourced in 2015. Positioned as a ‘second-generation machine learning system’, Tensorflow is a Python-based library capable of running on multiple CPUs and GPUs. It is available on all platforms, desktop, and mobile. It also has support for other languages such as C++ and R and can be used directly to create deep learning models, or by using wrapper libraries (for e.g. Keras) on top of it. In November 2017, Tensorflow announced a developer preview for Tensorflow Lite, a lightweight machine learning solution for mobile and embedded devices. The machine learning paradigm is continuously evolving - and the focus is now slowly shifting towards developing machine learning models that run on mobile and portable devices in order to make the applications smarter and more intelligent. Learn how to build a neural network with TensorFlow. If you're just starting out with deep learning, TensorFlow is THE go-to framework. It’s Python-based, backed by Google, has a very good documentation, and there are tons of tutorials and videos available on the internet to guide you. You can check out Packt’s TensorFlow catalog here. Keras Although TensorFlow is a very good deep learning library, creating models using only Tensorflow can be a challenge, as it is a pretty low-level library and can be quite complex to use for a beginner. To tackle this challenge, Keras was built as a simplified interface for building efficient neural networks in just a few lines of code and it can be configured to work on top of TensorFlow. Written in Python, Keras is very lightweight, easy to use, and pretty straightforward to learn. Because of these reasons, Tensorflow has incorporated Keras as part of its core API. Despite being a relatively new library, Keras has a very good documentation in place. If you want to know more about how Keras solves your deep learning problems, this interview by our best-selling author Sujit Pal should help you. Read now: Why you should use Keras for deep learning [box type="success" align="" class="" width=""]If you have some knowledge of Python programming and want to get started with deep learning, this is one library you definitely want to check out![/box] Caffe Built with expression, speed, and modularity in mind, Caffe is one of the first deep learning libraries developed mainly by Berkeley Vision and Learning Center (BVLC). It is a C++ library which also has a Python interface and finds its primary application in modeling Convolutional Neural Networks. One of the major benefits of using this library is that you can get a number of pre-trained networks directly from the Caffe Model Zoo, available for immediate use. If you’re interested in modeling CNNs or solve your image processing problems, you might want to consider this library. Following the footsteps of Caffe, Facebook also recently open-sourced Caffe2, a new light-weight, modular deep learning framework which offers greater flexibility for building high-performance deep learning models. Torch Torch is a Lua-based deep learning framework and has been used and developed by big players such as Facebook, Twitter and Google. It makes use of the C/C++ libraries as well as CUDA for GPU processing.  Torch was built with an aim to achieve maximum flexibility and make the process of building your models extremely simple. More recently, the Python implementation of Torch, called PyTorch, has found popularity and is gaining rapid adoption. PyTorch PyTorch is a Python package for building deep neural networks and performing complex tensor computations. While Torch uses Lua, PyTorch leverages the rising popularity of Python, to allow anyone with some basic Python programming language to get started with deep learning. PyTorch improves upon Torch’s architectural style and does not have any support for containers - which makes the entire deep modeling process easier and transparent to you. Still wondering how PyTorch and Torch are different from each other? Make sure you check out this interesting post on Quora. Deeplearning4j DeepLearning4j (or DL4J) is a popular deep learning framework developed in Java and supports other JVM languages as well. It is very slick and is very widely used as a commercial, industry-focused distributed deep learning platform. The advantage of using DL4j is that you can bring together the power of the whole Java ecosystem to perform efficient deep learning, as it can be implemented on top of the popular Big Data tools such as Apache Hadoop and Apache Spark. [box type="success" align="" class="" width=""]If Java is your programming language of choice, then you should definitely check out this framework. It is clean, enterprise-ready, and highly effective. If you’re planning to deploy your deep learning models to production, this tool can certainly be of great worth![/box] MXNet MXNet is one of the most languages-supported deep learning frameworks, with support for languages such as R, Python, C++ and Julia. This is helpful because if you know any of these languages, you won’t need to step out of your comfort zone at all, to train your deep learning models. Its backend is written in C++ and cuda, and is able to manage its own memory like Theano. MXNet is also popular because it scales very well and is able to work with multiple GPUs and computers, which makes it very useful for the enterprises. This is also one of the reasons why Amazon made MXNet its reference library for Deep Learning too. In November, AWS announced the availability of ONNX-MXNet, which is an open source Python package to import ONNX (Open Neural Network Exchange) deep learning models into Apache MXNet. Read why MXNet is a versatile deep learning framework here. Microsoft Cognitive Toolkit Microsoft Cognitive Toolkit, previously known by its acronym CNTK, is an open-source deep learning toolkit to train deep learning models. It is highly optimized and has support for languages such as Python and C++. Known for its efficient resource utilization, you can easily implement efficient Reinforcement Learning models or Generative Adversarial Networks (GANs) using the Cognitive Toolkit. It is designed to achieve high scalability and performance and is known to provide high-performance gains when compared to other toolkits like Theano and Tensorflow when running on multiple machines. Here is a fun comparison of TensorFlow versus CNTK, if you would like to know more. deeplearn.js Gone are the days when you required serious hardware to run your complex machine learning models. With deeplearn.js, you can now train neural network models right on your browser! Originally developed by the Google Brain team, deeplearn.js is an open-source, JavaScript-based deep learning library which runs on both WebGL 1.0 and WebGL 2.0. deeplearn.js is being used today for a variety of purposes - from education and research to training high-performance deep learning models. You can also run your pre-trained models on the browser using this library. BigDL BigDL is distributed deep learning library for Apache Spark and is designed to scale very well. With the help of BigDL, you can run your deep learning applications directly on Spark or Hadoop clusters, by writing them as Spark programs. It has a rich deep learning support and uses Intel’s Math Kernel Library (MKL) to ensure high performance. Using BigDL, you can also load your pre-trained Torch or Caffe models into Spark. If you want to add deep learning functionalities to a massive set of data stored on your cluster, this is a very good library to use. [box type="shadow" align="" class="" width=""]Editor's Note: We have removed Theano and Lasagne from the original list due to the Theano retirement announcement. RIP Theano! Before Tensorflow, Caffe or PyTorch came to be, Theano was the most widely used library for deep learning. While it was a low-level library supporting CPU as well as GPU computations, you could wrap it with libraries like Keras to simplify the deep learning process. With the release of version 1.0, it was announced that the future development and support for Theano would be stopped. There would be minimal maintenance to keep it working for the next one year, after which even the support activities on the library would be suspended completely. “Supporting Theano is no longer the best way we can enable the emergence and application of novel research ideas”, said Prof. Yoshua Bengio, one of the main developers of Theano. Thank you Theano, you will be missed! Goodbye Lasagne Lasagne is a high-level deep learning library that runs on top of Theano.  It has been around for quite some time now and was developed with the aim of abstracting the complexities of Theano, and provide a more friendly interface to the users to build and train neural networks. It requires Python and finds many similarities to Keras, which we just saw above. However, if we are to find differences between the two, Keras is faster and has a better documentation in place.[/box] There are many other deep learning libraries and frameworks available for use today – DSSTNE, Apache Singa, Veles are just a few worth an honorable mention. Which deep learning frameworks will best suit your needs? Ultimately, it depends on a number of factors. If you want to get started with deep learning, your safest bet would be to use a Python-based framework like Tensorflow, which are quite popular. For seasoned professionals, the efficiency of the trained model, ease of use, speed and resource utilization are all important considerations for choosing the best deep learning framework.
Read more
  • 0
  • 0
  • 11907
article-image-introduction-titanic-datasets
Packt
09 May 2017
11 min read
Save for later

Introduction to Titanic Datasets

Packt
09 May 2017
11 min read
In this article by Alexis Perrier, author of the book Effective Amazon Machine Learning says artificial intelligence and big data have become a ubiquitous part of our everyday lives; cloud-based machine learning services are part of a rising billion-dollar industry. Among the several such services currently available on the market, Amazon Machine Learning stands out for its simplicity. Amazon Machine Learning was launched in April 2015 with a clear goal of lowering the barrier to predictive analytics by offering a service accessible to companies without the need for highly skilled technical resources. (For more resources related to this topic, see here.) Working with datasets You cannot do predictive analytics without a dataset. Although we are surrounded by data, finding datasets that are adapted to predictive analytics is not always straightforward. In this section, we present some resources that are freely available. The Titanic datasetis a classic introductory datasets for predictive analytics. Finding open datasets There is a multitude of dataset repositories available online, from local to global public institutions to non-profit and data-focused start-ups. Here’s a small list of open dataset resources that are well suited forpredictive analytics. This, by far, is not an exhaustive list. This thread on Quora points to many other interesting data sources:https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public.You can also ask for specific datasets on Reddit at https://www.reddit.com/r/datasets/. The UCI Machine Learning Repository is a collection of datasets maintained by UC Irvine since 1987, hosting over 300 datasets related to classification, clustering, regression, and other ML tasks Mldata.org from the University of Berlinor the Stanford Large Network Dataset Collection and other major universities alsooffer great collections of open datasets Kdnuggets.com has an extensive list of open datasets at http://www.kdnuggets.com/datasets Data.gov and other US government agencies;data.UN.org and other UN agencies AWS offers open datasets via partners at https://aws.amazon.com/government-education/open-data/. The following startups are data centered and give open access to rich data repositories: Quandl and quantopian for financial datasets Datahub.io, Enigma.com, and Data.world are dataset-sharing sites Datamarket.com is great for time series datasets Kaggle.com, the data science competition website, hosts over 100 very interesting datasets AWS public datasets:AWS hosts a variety of public datasets,such as the Million Song Dataset, the mapping of the Human Genome, the US Census data as well as many others in Astrology, Biology, Math, Economics, and so on. These datasets are mostly available via EBS snapshots although some are directly accessible on S3. The datasets are large, from a few gigabytes to several terabytes, and are not meant to be downloaded on your local machine; they are only to be accessible via an EC2 instance (take a look at http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-public-data-sets.htmlfor further details).AWS public datasets are accessible at https://aws.amazon.com/public-datasets/. Introducing the Titanic dataset We will use the classic Titanic dataset. The dataconsists of demographic and traveling information for1,309 of the Titanic passengers, and the goal isto predict the survival of these passengers. The full Titanic dataset is available from the Department of Biostatistics at the Vanderbilt University School of Medicine (http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv)in several formats. The Encyclopedia Titanica website (https://www.encyclopedia-titanica.org/) is the website of reference regarding the Titanic. It contains all the facts, history, and data surrounding the Titanic, including a full list of passengers and crew members. The Titanic datasetis also the subject of the introductory competition on Kaggle.com (https://www.kaggle.com/c/titanic, requires opening an account with Kaggle). You can also find a csv version in GitHub repository at https://github.com/alexperrier/packt-aml/blob/master/ch4. The Titanic data containsa mix of textual, Boolean, continuous, and categorical variables. It exhibits interesting characteristics such as missing values, outliers, and text variables ripe for text mining--a rich database that will allow us to demonstrate data transformations. Here’s a brief summary of the 14attributes: pclass: Passenger class (1 = 1st; 2 = 2nd; 3 = 3rd) survival: A Boolean indicating whether the passenger survived or not (0 = No; 1 = Yes); this is our target name: A field rich in information as it contains title and family names sex: male/female age: Age, asignificant portion of values aremissing sibsp: Number of siblings/spouses aboard parch: Number of parents/children aboard ticket: Ticket number. fare: Passenger fare (British Pound). cabin: Doesthe location of the cabin influence chances of survival? embarked: Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) boat: Lifeboat, many missing values body: Body Identification Number home.dest: Home/destination Take a look at http://campus.lakeforest.edu/frank/FILES/MLFfiles/Bio150/Titanic/TitanicMETA.pdf for more details on these variables. We have 1,309 records and 14 attributes, three of which we will discard. The home.dest attribute hastoo few existing values, the boat attribute is only present for passengers who have survived, and thebody attributeis only for passengers who have not survived. We will discard these three columnslater on while using the data schema. Preparing the data Now that we have the initial raw dataset, we are going to shuffle it, split it into a training and a held-out subset, and load it to an S3 bucket. Splitting the data In order to build and select the best model, we need to split the dataset into three parts: training, validation, and test, with the usual ratios being 60%, 20%, and 20%. The training and validation sets are used to build several models and select the best one while the test or held-out set, is used for the final performance evaluation on previously unseen data. Since Amazon ML does the job of splitting the dataset used for model training and model evaluation into a training and a validation subsets, we only need to split our initial dataset into two parts: the global training/evaluation subset (80%) for model building and selection, and the held-out subset (20%) for predictions and final model performance evaluation. Shuffle before you split:If you download the original data from the Vanderbilt University website,you will notice that it is ordered by pclass, the class of the passenger and by alphabetical order of the name column. The first 323 rows correspond to the 1st class followed by 2nd (277) and 3rd (709) class passengers. It is important to shuffle the data before you split it so that all the different variables have have similar distributions in each training and held-out subsets. You can shuffle the data directly in the spreadsheet by creating a new column, generating a random number for each row and then ordering by that column. On GitHub: You will find an already shuffledtitanic.csv file at https://github.com/alexperrier/packt-aml/blob/master/ch4/titanic.csv. In addition to shuffling the data, we have removed punctuation in the name column: commas, quotes, and parenthesis, which can add confusion when parsing a csv file. We end up with two files:titanic_train.csv with 1047 rows and titanic_heldout.csv with 263rows. These files are also available in the GitHub repo (https://github.com/alexperrier/packt-aml/blob/master/ch4). The next step is to upload these files on S3 so that Amazon ML can access them. Loading data on S3 AWS S3 is one of the main AWS services dedicated to hosting files and managing their access. Files in S3 can be public and open to the internet or have access restricted to specific users, roles, or services.S3 is also used extensively by AWS for operations such as storing log files or results (predictions, scripts, queries, and so on). Files in S3 are organized around the notion of buckets. Buckets are placeholders with unique names similar to domain names for websites. A file in S3 will have a unique locator URI: s3://bucket_name/{path_of_folders}/filename. The bucket name is unique across S3. In this section, we will create a bucket for our data, upload the titanic training file, and open its access to Amazon ML. Go to https://console.aws.amazon.com/s3/home, and open an S3 account if you don’t have one yet. S3 pricing:S3 charges for the total volume of files you host and the volume of file transfers depends on the region where the files are hosted. At time of writing, for less than 1TB, AWS S3 charges $0.03/GB per month in the US east region. All S3 prices are available at https://aws.amazon.com/s3/pricing/. See also http://calculator.s3.amazonaws.com/index.htmlfor the AWS cost calculator. Creating a bucket Once you have created your S3 account, the next step is to create a bucket for your files.Click on the Create bucket button: Choose a name and a region, since bucket names are unique across S3, you must choose a name for your bucket that has not been already taken. We chose the name aml.packt for our bucket, and we will use this bucket throughout. Regarding the region, you should always select a region that is the closest to the person or application accessing the files in order to reduce latency and prices. Set Versioning, Logging, and Tags, versioning will keep a copy of every version of your files, which prevents from accidental deletions. Since versioning and logging induce extra costs, we chose to disable them. Set permissions. Review and save. Loading the data To upload the data, simply click on the upload button and select the titanic_train.csv file we created earlier on. You should, at this point, have the training dataset uploaded to your AWS S3 bucket. We added a/data folder in our aml.packt bucket to compartmentalize our objects. It will be useful later on when the bucket will also contain folders created by S3. At this point, only the owner of the bucket (you) is able to access and modify its contents. We need to grant the Amazon ML service permissions to read the data and add other files to the bucket. When creating the Amazon ML datasource, we will be prompted to grant these permissions inthe Amazon ML console. We can also modify the bucket’s policy upfront. Granting permissions We need to edit the policy of the aml.packt bucket. To do so, we have to perform the following steps: Click into your bucket. Select the Permissions tab. In the drop down, select Bucket Policy as shown in the following screenshot. This will open an editor: Paste in the following JSON. Make sure to replace {YOUR_BUCKET_NAME} with the name of your bucket and save: { “Version”: “2012-10-17”, “Statement”: [ { “Sid”: “AmazonML_s3:ListBucket”, “Effect”: “Allow”, “Principal”: { “Service”: “machinelearning.amazonaws.com” }, “Action”: “s3:ListBucket”, “Resource”: “arn:aws:s3:::{YOUR_BUCKET_NAME}”, “Condition”: { “StringLike”: { “s3:prefix”: “*” } } }, { “Sid”: “AmazonML_s3:GetObject”, “Effect”: “Allow”, “Principal”: { “Service”: “machinelearning.amazonaws.com” }, “Action”: “s3:GetObject”, “Resource”: “arn:aws:s3:::{YOUR_BUCKET_NAME}/*” }, { “Sid”: “AmazonML_s3:PutObject”, “Effect”: “Allow”, “Principal”: { “Service”: “machinelearning.amazonaws.com” }, “Action”: “s3:PutObject”, “Resource”: “arn:aws:s3:::{YOUR_BUCKET_NAME}/*” } ] } Further details on this policy are available at http://docs.aws.amazon.com/machine-learning/latest/dg/granting-amazon-ml-permissions-to-read-your-data-from-amazon-s3.html. Once again, this step is optional since Amazon ML will prompt you for access to the bucket when you create the datasource. Formatting the data Amazon ML works on comma separated values files (.csv)--a very simple format where each rowis an observation and each column is a variable or attribute. There are, however, a few conditionsthat shouldbe met: The data must be encoded in plain text using a character set, such asASCII, Unicode, or EBCDIC All values must be separated by commas; if a value contains a comma, it should be enclosed by double quotes Each observation (row) must be smaller than 100k There are also conditions regarding end of line characters that separate rows. Special care must be taken when using Excel on OS X (Mac) as explained on this page: http://docs.aws.amazon.com/machine-learning/latest/dg/understanding-the-data-format-for-amazon-ml.html What about other data file formats? Unfortunately, Amazon ML datasource are only compatible with csv files and Redshift databases and does not accept formats such as JSON, TSV, or XML. However, other services such as Athena, a serverless database service, do accept a wider range of formats. Summary In this article we learnt about how to use and work around with datasets using Amazon web services and Titanic datasets. We also learnt how prepare data and Amazon S3 services.  Resources for Article: Further resources on this subject: Processing Massive Datasets with Parallel Streams – the MapReduce Model [article] Processing Next-generation Sequencing Datasets Using Python [article] Combining Vector and Raster Datasets [article]
Read more
  • 0
  • 1
  • 12129

article-image-using-raspberry-pi-camera-module
Packt
06 Apr 2017
6 min read
Save for later

Using the Raspberry Pi Camera Module

Packt
06 Apr 2017
6 min read
This article by Shervin Emami, co-author of the book, Mastering OpenCV 3 - Second Edition, explains how to use the Raspberry Pi Camera Module for your Cartoonifier and Skin changer applications. While using a USB webcam on Raspberry Pi has the convenience of supporting identical behavior & code on desktop as on embedded device, you might consider using one of the official Raspberry Pi Camera Modules (referred to as the RPi Cams). They have some advantages and disadvantages over USB webcams: (For more resources related to this topic, see here.) RPi Cams uses the special MIPI CSI camera format, designed for smartphone cameras to use less power, smaller physical size, faster bandwidth, higher resolutions, higher framerates, and reduced latency, compared to USB. Most USB2.0 webcams can only deliver 640x480 or 1280x720 30 FPS video, since USB2.0 is too slow for anything higher (except for some expensive USB webcams that perform onboard video compression) and USB3.0 is still too expensive. Whereas smartphone cameras (including the RPi Cams) can often deliver 1920x1080 30 FPS or even Ultra HD / 4K resolutions. The RPi Cam v1 can in fact deliver up to 2592x1944 15 FPS or 1920x1080 30 FPS video even on a $5 Raspberry Pi Zero, thanks to the use of MIPI CSI for the camera and a compatible video processing ISP & GPU hardware inside the Raspberry Pi. The RPi Cam also supports 640x480 in 90 FPS mode (such as for slow-motion capture), and this is quite useful for real-time computer vision so you can see very small movements in each frame, rather than large movements that are harder to analyze. However the RPi Cam is a plain circuit board that is highly sensitive to electrical interference, static electricity or physical damage (simply touching the small orange flat cable with your finger can cause video interference or even permanently damage your camera!). The big flat white cable is far less sensitive but it is still very sensitive to electrical noise or physical damage. The RPi Cam comes with a very short 15cm cable. It's possible to buy third-party cables on eBay with lengths between 5cm to 1m, but cables 50cm or longer are less reliable, whereas USB webcams can use 2m to 5m cables and can be plugged into USB hubs or active extension cables for longer distances. There are currently several different RPi Cam models, notably the NoIR version that doesn't have an internal Infrared filter, therefore a NoIR camera can easily see in the dark (if you have an invisible Infrared light source), or see Infrared lasers or signals far clearer than regular cameras that include an Infrared filter inside them. There are also 2 different versions of RPi Cam (shown above): RPi Cam v1.3 and RPi Cam v2.1, where the v2.1 uses a wider angle lens with a Sony 8 Mega-Pixel sensor instead of a 5 Mega-Pixel OmniVision sensor, and has better support for motion in low lighting conditions, and adds support for 3240x2464 15 FPS video and potentially upto 120 FPS video at 720p. However, USB webcams come in thousands of different shapes & versions, making it easy to find specialized webcams such as waterproof or industrial-grade webcams, rather than requiring you to create your own custom housing for an RPi Cam. IP Cameras are also another option for a camera interface that can allow 1080p or higher resolution videos with Raspberry Pi, and IP cameras support not just very long cables, but potentially even work anywhere in the world using Internet. But IP cameras aren't quite as easy to interface with OpenCV as USB webcams or the RPi Cam. In the past, RPi Cams and the official drivers weren't directly compatible with OpenCV, you often used custom drivers and modified your code in order to grab frames from RPi Cams, but it's now possible to access an RPi Cam in OpenCV the exact same way as a USB Webcam! Thanks to recent improvements in the v4l2 drivers, once you load the v4l2 driver the RPi Cam will appear as file /dev/video0 or /dev/video1 like a regular USB webcam. So traditional OpenCV webcam code such as cv::VideoCapture(0) will be able to use it just like a webcam. Installing the Raspberry Pi Camera Module driver First let's temporarily load the v4l2 driver for the RPi Cam to make sure our camera is plugged in correctly: sudo modprobe bcm2835-v4l2 If the command failed (that is, it printed an error message to the console, or it froze, or the command returned a number beside 0), then perhaps your camera is not plugged in correctly. Shutdown then unplug power from your RPi and try attaching the flat white cable again, looking at photos on the Web to make sure it's plugged in the correct way around. If it is the correct way around, it's possible the cable wasn't fully inserted before you closed the locking tab on the RPi. Also check if you forgot to click Enable Camera when configuring your Raspberry Pi earlier, using the sudo raspi-config command. If the command worked (that is, the command returned 0 and no error was printed to the console), then we can make sure the v4l2 driver for the RPi Cam is always loaded on bootup, by adding it to the bottom of the /etc/modules file: sudo nano /etc/modules # Load the Raspberry Pi Camera Module v4l2 driver on bootup: bcm2835-v4l2 After you save the file and reboot your RPi, you should be able to run ls /dev/video* to see a list of cameras available on your RPi. If the RPi Cam is the only camera plugged into your board, you should see it as the default camera (/dev/video0), or if you also have a USB webcam plugged in then it will be either /dev/video0 or /dev/video1. Let's test the RPi Cam using the starter_video sample program: cd ~/opencv-3.*/samples/cpp DISPLAY=:0 ./starter_video 0 If it's showing the wrong camera, try DISPLAY=:0 ./starter_video 1. Now that we know the RPi Cam is working in OpenCV, let's try Cartoonifier! cd ~/Cartoonifier DISPLAY=:0 ./Cartoonifier 0 (or DISPLAY=:0 ./Cartoonifier 1 for the other camera). Resources for Article: Further resources on this subject: Video Surveillance, Background Modeling [article] Performance by Design [article] Getting started with Android Development [article]
Read more
  • 0
  • 0
  • 3439

Packt
06 Apr 2017
15 min read
Save for later

Synchronization – An Approach to Delivering Successful Machine Learning Projects

Packt
06 Apr 2017
15 min read
“In the midst of chaos, there is also opportunity”                                                                                                                - Sun Tzu In this article, by Cory Lesmeister, the author of the book Mastering Machine Learning with R - Second Edition, Cory provides insights on ensuring the success and value of your machine learning endeavors. (For more resources related to this topic, see here.) Framing the problem Raise your hand if any of the following has happened or is currently happening to you: You’ve been part of a project team that failed to deliver anything of business value You attend numerous meetings, but they don’t seem productive; maybe they are even complete time wasters Different teams are not sharing information with each other; thus, you are struggling to understand what everyone else is doing, and they have no idea what you are doing or why you are doing it An unknown stakeholder, feeling threatened by your project, comes from out of nowhere and disparages you and/or your work The Executive Committee congratulates your team on their great effort, but decides not to implement it, or even worse, tells you to go back and do it all over again, only this time solve the real problem OK, you can put your hand down now. If you didn’t raise your hand, please send me your contact information because you are about as rare as a unicorn. All organizations, regardless of their size, struggle with integrating different functions, current operations, and other projects. In short, the real-world is filled with chaos. It doesn’t matter how many advanced degrees people have, how experienced they are, how much money is thrown at the problem, what technology is used, how brilliant and powerful the machine learning algorithm is, problems such as those listed above will happen. The bottom line is that implementing machine learning projects in the business world is complicated and prone to failure. However, out of this chaos you have the opportunity to influence your organization by integrating disparate people and teams, fostering a collaborative environment that can adapt to unforeseen changes. But, be warned, this is not easy. If it was easy everyone would be doing it. However, it works and, it works well. By it, I’m talking about the methodology I developed about a dozen years ago, a method I refer to as the “Synchronization Process”. If we ask ourselves, “what are the challenges to implementation”, it seems to me that the following blog post, clearly and succinctly sums it up: https://www.capgemini.com/blog/capping-it-off/2012/04/four-key-challenges-for-business-analytics It enumerates four challenges: Strategic alignment Agility Commitment Information maturity This blog addresses business analytics, but it can be extended to machine learning projects. One could even say machine learning is becoming the analytics tool of choice in many organizations. As such, I will make the case below that the Synchronization Process can effectively deal with the first three challenges. Not only that, the process can provide additional benefits. By overcoming the challenges, you can deliver an effective project, by delivering an effective project you can increase actionable insights and by increasing actionable insights, you will improve decision-making, and that is where the real business value resides. Defining the process “In preparing for battle, I have always found that plans are useless, but planning is indispensable.”                                                                                       - Dwight D. Eisenhower I adopted the term synchronization from the US Army’s operations manual, FM 3-0 where it is described as a battlefield tenet and force multiplier. The manual defines synchronization as, “…arranging activities in time, space and purpose to mass maximum relative combat power at a decisive place and time”. If we overlay this military definition onto the context of a competitive marketplace, we come up with a definition I find more relevant. For our purpose, synchronization is defined as, “arranging business functions and/or tasks in time and purpose to produce the proper amount of focus on a critical event or events”. These definitions put synchronization in the context of an “endstate” based on a plan and a vision. However, it is the process of seeking to achieve that endstate that the true benefits come to fruition. So, we can look at synchronization as not only an endstate, but also as a process. The military’s solution to synchronizing operations before implementing a plan is the wargame. Like the military, businesses and corporations have utilized wargaming to facilitate decision-making and create integration of different business functions. Following the synchronization process techniques explained below, you can take the concept of business wargaming to a new level. I will discuss and provide specific ideas, steps, and deliverables that you can implement immediately. Before we begin that discussion, I want to cover the benefits that the process will deliver. Exploring the benefits of the process When I created this methodology about a dozen years ago, I was part of a market research team struggling to commit our limited resources to numerous projects, all of which were someone’s top priority, in a highly uncertain environment. Or, as I like to refer to it, just another day at the office. I knew from my military experience that I had the tools and techniques to successfully tackle these challenges. It worked then and has been working for me ever since. I have found that it delivers the following benefits to an organization: Integration of business partners and stakeholders Timely and accurate measurement of performance and effectiveness Anticipation of and planning for possible events Adaptation to unforeseen threats Exploitation of unforeseen opportunities Improvement in teamwork Fostering a collaborative environment Improving focus and prioritization In market research, and I believe it applies to all analytical endeavors, including machine learning, we talked about focusing on three specific questions about what to measure: What are we measuring? When do we measure it? How will we measure it? We found that successfully answering those questions facilitated improved decision-making by informing leadership what STOP doing, what to START doing and what to KEEP doing. I have found myself in many meetings going nowhere when I would ask a question like, “what are you looking to stop doing?” Ask leadership what they want to stop, start, or continue to do and you will get to the core of the problem. Then, your job will be to configure the business decision as the measurement/analytical problem. The Synchronization Process can bring this all together in a coherent fashion. I’ve been asked often about what triggers in my mind that a project requires going through the Synchronization Process. Here are some of the questions you should consider, and if you answer “yes” to any of them, it may be a good idea to implement the process: Are resources constrained to the point that several projects will suffer poor results or not be done at all? Do you face multiple, conflicting priorities? Could the external environment change and dramatically influence project(s)? Are numerous stakeholders involved or influenced by a project’s result? Is the project complex and facing a high-level of uncertainty? Does the project involve new technology? Does the project face the actual or potential for organizational change? You may be thinking, “Hey, we have a project manager for all this?” OK, how is that working out? Let me be crystal clear here, this is not just project management! This is about improving decision-making! A Gannt Chart or task management software won’t do that. You must be the agent of change. With that, let’s turn our attention to the process itself. Exploring the process Any team can take the methods elaborated on below and incorporate them to their specific situation with their specific business partners.  If executed properly, one can expect the initial investment in time and effort to provide substantial payoff within weeks of initiating the process. There are just four steps to incorporate with each having several tasks for you and your team members to complete. The four steps are as follows: Project kick-off Project analysis Synchronization exercise Project execution Let’s cover each of these in detail. I will provide what I like to refer to as a “Quad Chart” for each process step along with appropriate commentary. Project kick-off I recommend you lead the kick-off meeting to ensure all team members understand and agree to the upcoming process steps. You should place emphasis on the importance of completing the pre-work and understanding of key definitions, particularly around facts and critical assumptions. The operational definitions are as follows: Facts: Data or information that will likely have an impact on the project Critical assumptions: Valid and necessary suppositions in the absence of facts that, if proven false, would adversely impact planning or execution It is an excellent practice to link facts and assumptions. Here is an example of how that would work: It is a FACT that the Information Technology is beta-testing cloud-based solutions. We must ASSUME for planning purposes, that we can operate machine learning solutions on the cloud by the fourth quarter of this year. See, we’ve linked a fact and an assumption together and if this cloud-based solution is not available, let’s say it would negatively impact our ability to scale-up our machine learning solutions. If so, then you may want to have a contingency plan of some sort already thought through and prepared for implementation. Don’t worry if you haven’t thought of all possible assumptions or if you end up with a list of dozens. The synchronization exercise will help in identifying and prioritizing them. In my experience, identifying and tracking 10 critical assumptions at the project level is adequate. The following is the quad chart for this process step: Figure 1: Project kick-off quad chart Notice what is likely a new term, “Synchronization Matrix”. That is merely the tool used by the team to capture notes during the Synchronization Exercise. What you are doing is capturing time and events on the X-axis, and functions and terms on the Y-axis. Of course, this is highly customizable based on the specific circumstances and we will discuss more about it in process step number 3, that is Synchronization exercise, but here is an abbreviated example: Figure 2: Synchronization matrix example You can see in the matrix that I’ve included a row to capture critical assumptions. I can’t understate how important it is to articulate, capture, and track them. In fact, this is probably my favorite quote on the subject: … flawed assumptions are the most common cause of flawed execution. Harvard Business Review, The High Performance Organization, July-August 2005 OK, I think I’ve made my point, so let’s look at the next process step. Project analysis At this step, the participants prepare by analyzing the situation, collecting data, and making judgements as necessary. The goal is for each participant of the Synchronization Exercise to come to that meeting fully prepared. A good technique is to provide project participants with a worksheet template for them to use to complete the pre-work. A team can complete this step either individually, collectively or both. Here is the quad chart for the process step: Figure 3: Project analysis quad chart Let me expand on a couple of points. The idea of a team member creating information requirements is quite important. These are often tied back to your critical assumptions. Take the example above of the assumption around fielding a cloud-based capability. Can you think of some information requirements that might have as a potential end-user? Furthermore, can you prioritize them? OK, having done that, can you think of a plan to acquire that information and confirm or deny the underlying critical assumption? Notice also how that ties together with decision points you or others may have to make and how they may trigger contingency plans. This may sound rather basic and simplistic, but unless people are asked to think like this, articulate their requirements, share the information don’t expect anything to change anytime soon. It will be business as usual and let me ask again, “how is that working out for you?”. There is opportunity in all that chaos, so embrace it, and in the next step you will see the magic happen. Synchronization exercise The focus and discipline of the participants determine the success of this process step. This is a wargame-type exercise where team members portray their plan over time. Now, everyone gets to see how their plan relates to or even inhibits someone else’s plan and vice versa. I’ve done this step several different ways, including building the matrix on software, but the method that has consistently produced the best results is to build the matrix on large paper and put it along a conference room wall. Then, have the participants, one at a time, use post-it notes to portray their key events.  For example, the marketing manager gets up to the wall and posts “Marketing Campaign One” in the first time phase, “Marketing Campaign Two” in the final time phase, along with “Propensity Models” in the information requirements block. Iterating by participant and by time/event leads to coordination and cooperation like nothing you’ve ever seen. Another method to facilitate the success of the meeting is to have a disinterested and objective third party “referee” the meeting. This will help to ensure that any issues are captured or resolved and the process products updated accordingly. After the exercise, team members can incorporate the findings to their individual plans. This is an example quad chart on the process step: Figure 4: Synchronization exercise quad chart I really like the idea of execution and performance metrics. Here is how to think about them: Execution metrics—are we doing things right? Performance metrics—are we doing the right things? As you see, execution is about plan implementation, while performance metrics are about determining if the plan is making a difference (yes, I know that can be quite a dangerous thing to measure). Finally, we come to the fourth step where everything comes together during the execution of the project plan. Project execution This is a continual step in the process where a team can utilize the synchronization products to maintain situational understanding of the itself, key stakeholders, and the competitive environment. It can determine and how plans are progressing and quickly react to opportunities and threats as necessary. I recommend you update and communicate changes to the documentation on a regular basis. When I was in pharmaceutical forecasting, it was imperative that I end the business week by updating the matrices on SharePoint, which were available to all pertinent team members. The following is the quad chart for this process step: Figure 5: Project execution quad chart Keeping up with the documentation is a quick and simple process for the most part, and by doing so you will keep people aligned and cooperating. Be aware that like everything else that is new in the world, initial exuberance and enthusiasm will start to wane after several weeks. That is fine as long as you keep the documentation alive and maintain systematic communication. You will soon find that behavior is changing without anyone even taking heed, which is probably the best way to actually change behavior. A couple of words of warning. Don’t expect everyone to embrace the process wholeheartedly, which is to say that office politics may create a few obstacles. Often, an individual or even an entire business function will withhold information as “information is power”, and by sharing information they may feel they are losing power. Another issue may rise where some people feel it is needlessly complex or unnecessary. A solution to these problems is to scale back the number of core team members and utilize stakeholder analysis and a communication plan to bring they naysayers slowly into the fold. Change is never easy, but necessary nonetheless. Summary In this article, I’ve covered, at a high-level, a successful and proven process to deliver machine learning projects that will drive business value. I developed it from my numerous years of planning and evaluating military operations, including a one-year stint as a strategic advisor to the Iraqi Oil Police, adapting it to the needs of any organization. Utilizing the Synchronization Process will help any team avoid the common pitfalls of projects and improve efficiency and decision-making. It will help you become an agent of change and create influence in an organization without positional power. Resources for Article: Further resources on this subject: Machine Learning with R [article] Machine Learning Using Spark MLlib [article] Welcome to Machine Learning Using the .NET Framework [article]
Read more
  • 0
  • 0
  • 1667
article-image-supervised-learning-classification-and-regression
Packt
06 Apr 2017
30 min read
Save for later

Supervised Learning: Classification and Regression

Packt
06 Apr 2017
30 min read
In this article by Alexey Grigorev, author of the book Mastering Java for Data Science, we will look at how to do pre-processing of data in Java and how to do Exploratory Data Analysis inside and outside Java. Now, when we covered the foundation, we are ready to start creating Machine Learning models. First, we start with Supervised Learning. In the supervised settings we have some information attached to each observation – called labels – and we want to learn from it, and predict it for observations without labels. There are two types of labels: the first are discrete and finite, such as True/False or Buy/Sell, and second are continuous, such as salary or temperature. These types correspond to two types of Supervised Learning: Classification and Regression. We will talk about them in this article. This article covers: Classification problems Regression Problems Evaluation metrics for each type Overview of available implementations in Java (For more resources related to this topic, see here.) Classification In Machine Learning, the classification problem deals with discrete targets with finite set of possible values. The Binary Classification is the most common type of classification problems: the target variable can have only two possible values, such as True/False, Relevant/Not Relevant, Duplicate/Not Duplicate, Cat/Dog and so on. Sometimes the target variable can have more than two outcomes, for example: colors, category of an item, model of a car, and so on, and we call it Multi-Class Classification. Typically each observation can only have one label, but in some settings an observation can be assigned several values. Typically, Multi-Class Classification can be converted to a set of Binary Classification problems, which is why we will mostly concentrate on Binary Classification. Binary Classification Models There are many models for solving the binary classification problem and it is not possible to cover all of them. We will briefly cover the ones that are most often used in practice. They include: Logistic Regression Support Vector Machines Decision Trees Neural Networks We assume that you are already familiar with these methods. Deep familiarity is not required, but for more information you can check the following book: An Introduction to Statistical Learning by G. James, D. Witten, T. Hastie and R. Tibshirani Python Machine Learning by S. Raschka When it comes to libraries, we will cover the following: Smile, JSAT, LIBSVM and LIBLINEAR and Encog. Smile Smile (Statistical Machine Intelligence and Learning Engine) is a library with a large set of classification and other machine learning algorithms. You can see the full list here: https://github.com/haifengl/smile. The library is available on Maven Central and the latest version at the moment of writing is 1.1.0. To include it to your project, add the following dependency: <dependency> <groupId>com.github.haifengl</groupId> <artifactId>smile-core</artifactId> <version>1.1.0</version> </dependency> It is being actively developed; new features and bug fixes are added quite often, but not released as frequently. We recommend to use the latest available version of Smile, and to get it, you will need to build it from the sources. To do it: Install sbt – a tool for building scala projects. You can follow this instruction: http://www.scala-sbt.org/release/docs/Manual-Installation.html Use git to clone the project from https://github.com/haifengl/smile To build and publish the library to local maven repository, run the following command: sbt core/publishM2 The Smile library consists of several sub-modules, such as smile-core, smile-nlp, and smile-plot and so on. So, after building it, add the following dependency to your pom: <dependency> <groupId>com.github.haifengl</groupId> <artifactId>smile-core</artifactId> <version>1.2.0</version> </dependency> The models from Smile expect the data to be in a form of two-dimensional arrays of doubles, and the label information is one dimensional array of integers. For binary models, the values should be 0 or 1. Some models in Smile can handle Multi-Class Classification problems, so it is possible to have more labels. The models are built using the Builder pattern: you create a special class, set some parameters and at the end it returns the object it builds. In Smile this builder class is typically called Trainer, and all models should have a trainer for them. For example, consider training a Random Forest model: double[] X = ...// training data int[] y = ...// 0 and 1 labels RandomForest model = new RandomForest.Trainer() .setNumTrees(100) .setNodeSize(4) .setSamplingRates(0.7) .setSplitRule(SplitRule.ENTROPY) .setNumRandomFeatures(3) .train(X, y); The RandomForest.Trainer class takes in a set of parameters and the training data, and at the end produces the trained Random Forest model. The implementation of Random Forest from Smile has the following parameters: numTrees: number of trees to train in the model. nodeSize: the minimum number of items in the leaf nodes. samplingRate: the ratio of training data used to grow each tree. splitRule: the impurity measure used for selecting the best split. numRandomFeatures: the number of features the model randomly chooses for selecting the best split. Similarly, a logistic regression is trained as follows: LogisticRegression lr = new LogisticRegression.Trainer() .setRegularizationFactor(lambda) .train(X, y); Once we have a model, we can use it for predicting the label of previously unseen items. For that we use the predict method: double[] row = // data int prediction = model.predict(row); This code outputs the most probable class for the given item. However, often we are more interested not in the label itself, but in the probability of having the label. If a model implements the SoftClassifier interface, then it is possible to get these probabilities like this: double[] probs = new double[2]; model.predict(row, probs); After running this code, the probs array will contain the probabilities. JSAT JSAT (Java Statistical Analysis Tool) is another Java library which contains a lot of implementations of common Machine Learning algorithms. You can check the full list of implemented models at https://github.com/EdwardRaff/JSAT/wiki/Algorithms. To include JSAT to a Java project, add this to pom: <dependency> <groupId>com.edwardraff</groupId> <artifactId>JSAT</artifactId> <version>0.0.5</version> </dependency> Unlike Smile, which just takes arrays of doubles, JSAT requires a special wrapper class for data instances. If we have an array, it is converted to the JSAT representation like this: double[][] X = ... // data int[] y = ... // labels // change to more classes for more classes for multi-classification CategoricalData binary = new CategoricalData(2); List<DataPointPair<Integer>> data = new ArrayList<>(X.length); for (int i = 0; i < X.length; i++) { int target = y[i]; DataPoint row = new DataPoint(new DenseVector(X[i])); data.add(new DataPointPair<Integer>(row, target)); } ClassificationDataSet dataset = new ClassificationDataSet(data, binary); Once we have prepared the dataset, we can train a model. Let us consider the Random Forest classifier again: RandomForest model = new RandomForest(); model.setFeatureSamples(4); model.setMaxForestSize(150); model.trainC(dataset); First, we set some parameters for the model, and then, at we end, we call the trainC method (which means “train a classifier”). In the JSAT implementation, Random Forest has fewer options for tuning: only the number of features to select and the number of trees to grow. There are several implementations of Logistic Regression. The usual Logistic Regression model does not have any parameters, and it is trained like this: LogisticRegression model = new LogisticRegression(); model.trainC(dataset); If we want to have a regularized model, then we need to use the LogisticRegressionDCD class (DCD stands for “Dual Coordinate Descent” - this is the optimization method used to train the logistic regression). We train it like this: LogisticRegressionDCD model = new LogisticRegressionDCD(); model.setMaxIterations(maxIterations); model.setC(C); model.trainC(fold.toJsatDataset()); In this code, C is the regularization parameter, and the smaller values of C correspond to stronger regularization effect. Finally, for outputting the probabilities, we can do the following: double[] row = // data DenseVector vector = new DenseVector(row); DataPoint point = new DataPoint(vector); CategoricalResults out = model.classify(point); double probability = out.getProb(1); The class CategoricalResults contains a lot of information, including probabilities for each class and the most likely label. LIBSVM and LIBLINEAR Next we consider two similar libraries: LIBSVM and LIBLINEAR. LIBSVM (https://www.csie.ntu.edu.tw/~cjlin/libsvm/) is a library with implementation of Support Vector Machine models, which include support vector classifiers. LIBLINEAR (https://www.csie.ntu.edu.tw/~cjlin/liblinear/) is a library for fast linear classification algorithms such as Liner SVM and Logistic Regression. Both these libraries come from the same research group and have very similar interfaces. LIBSVM is implemented in C++ and has an officially supported java version. It is available on Maven Central: <dependency> <groupId>tw.edu.ntu.csie</groupId> <artifactId>libsvm</artifactId> <version>3.17</version> </dependency> Note that the Java version of LIBSVM is updated not as often as the C++ version. Nevertheless, the version above is stable and should not contain bugs, but it might be slower than its C++ version. To use SVM models from LIBSVM, you first need to specify the parameters. For this, you create a svm_parameter class. Inside, you can specify many parameters, including: the kernel type (RBF, POLY or LINEAR), the regularization parameter C, probability, which you can set to 1 to be able to get probabilities, the svm_type should be set to C_SVC: this tells that the model should be a classifier. Here is an example of how you can configure an SVM classifier with the Linear kernel which can output probabilities: svm_parameter param = new svm_parameter(); param.svm_type = svm_parameter.C_SVC; param.kernel_type = svm_parameter.LINEAR; param.probability = 1; param.C = C; // default parameters param.cache_size = 100; param.eps = 1e-3; param.p = 0.1; param.shrinking = 1; The polynomial kernel is specified by the following formula: It has three additional parameters: gamma, coeff0 and degree; and also C – the regularization parameter. You can specify it like this: svm_parameter param = new svm_parameter(); param.svm_type = svm_parameter.C_SVC; param.kernel_type = svm_parameter.POLY; param.C = C; param.degree = degree; param.gamma = 1; param.coef0 = 1; param.probability = 1; // plus defaults from the above Finally, the Gaussian kernel (or RBF) has the following formula:' So there is one parameter gamma, which controls the width of the Gaussians. We can specify the model with the RBF kernel like this: svm_parameter param = new svm_parameter(); param.svm_type = svm_parameter.C_SVC; param.kernel_type = svm_parameter.RBF; param.C = C; param.gamma = gamma; param.probability = 1; // plus defaults from the above Once we created the configuration object, we need to convert the data in the right format. The LIBSVM command line application reads files in the SVMLight format, so the library also expects sparse data representation. For a single array, the conversion is following: double[] dataRow = // single row vector svm_node[] svmRow = new svm_node[dataRow.length]; for (int j = 0; j < dataRow.length; j++) { svm_node node = new svm_node(); node.index = j; node.value = dataRow[j]; svmRow[j] = node; } For a matrix, we do this for every row: double[][] X = ... // data int n = X.length; svm_node[][] nodes = new svm_node[n][]; for (int i = 0; i < n; i++) { nodes[i] = wrapAsSvmNode(X[i]); } Where wrapAsSvmNode is a function, that wraps a vector into an array of svm_node objects. Now we can put the data and the labels together into svm_problem object: double[] y = ... // labels svm_problem prob = new svm_problem(); prob.l = n; prob.x = nodes; prob.y = y; Now using the data and the parameters we can train the SVM model: svm_model model = svm.svm_train(prob, param);  Once the model is trained, we can use it for classifying unseen data. Getting probabilities is done this way: double[][] X = // test data int n = X.length; double[] results = new double[n]; double[] probs = new double[2]; for (int i = 0; i < n; i++) { svm_node[] row = wrapAsSvmNode(X[i]); svm.svm_predict_probability(model, row, probs); results[i] = probs[1]; } Since we used the param.probability = 1, we can use svm.svm_predict_probability method to predict probabilities. Like Smile, the method takes an array of doubles and writes the results there. Then we can get the probabilities there. Finally, while training, LIBSVM outputs a lot of things on the console. If we are not interested in this output, we can disable it with the following code snippet: svm.svm_set_print_string_function(s -> {}); Just add this in the beginning of your code and you will not see this anymore. The next library is LIBLINEAR, which provides very fast and high performing linear classifiers such as SVM with Linear Kernel and Logistic Regression. It can easily scale to tens and hundreds of millions of data points. Unlike LIBSVM, there is no official Java version of LIBLINEAR, but there is unofficial Java port available at http://liblinear.bwaldvogel.de/. To use it, include the following: <dependency> <groupId>de.bwaldvogel</groupId> <artifactId>liblinear</artifactId> <version>1.95</version> </dependency> The interface is very similar to LIBSVM. First, you define the parameters: SolverType solverType = SolverType.L1R_LR; double C = 0.001; double eps = 0.0001; Parameter param = new Parameter(solverType, C, eps); We have to specify three parameters here: solverType: defines the model which will be used; C: the amount regularization, the smaller C the stronger the regularization; epsilon: tolerance for stopping the training process. A reasonable default is 0.0001; For classification there are the following solvers: Logistic Regression: L1R_LR or L2R_LR SVM: L1R_L2LOSS_SVC or L2R_L2LOSS_SVC According to the official FAQ (which can be found here: https://www.csie.ntu.edu.tw/~cjlin/liblinear/FAQ.html) you should: Prefer SVM to Logistic Regression as it trains faster and usually gives higher accuracy. Try L2 regularization first unless you need a sparse solution – in this case use L1 The default solver is L2-regularized support linear classifier for the dual problem. If it is slow, try solving the primal problem instead. Then you define the dataset. Like previously, let's first see how to wrap a row: double[] row = // data int m = row.length; Feature[] result = new Feature[m]; for (int i = 0; i < m; i++) { result[i] = new FeatureNode(i + 1, row[i]); } Please note that we add 1 to the index – the 0 is the bias term, so the actual features should start from 1. We can put this into a function wrapRow and then wrap the entire dataset as following: double[][] X = // data int n = X.length; Feature[][] matrix = new Feature[n][]; for (int i = 0; i < n; i++) { matrix[i] = wrapRow(X[i]); } Now we can create the Problem class with the data and labels: double[] y = // labels Problem problem = new Problem(); problem.x = wrapMatrix(X); problem.y = y; problem.n = X[0].length + 1; problem.l = X.length; Note that here we also need to provide the dimensionality of the data, and it's the number of features plus one. We need to add one because it includes the bias term. Now we are ready to train the model: Model model = LibLinear.train(fold, param); When the model is trained, we can use it to classify unseen data. In the following example we will output probabilities: double[] dataRow = // data Feature[] row = wrapRow(dataRow); Linear.predictProbability(model, row, probs); double result = probs[1]; The code above works fine for the Logistic Regression model, but it will not work for SVM: SVM cannot output probabilities, so the code above will throw an error for solvers like L1R_L2LOSS_SVC. What you can do instead is get the raw output: double[] values = new double[1]; Feature[] row = wrapRow(dataRow); Linear.predictValues(model, row, values); double result = values[0]; In this case the results will not contain probability, but some real value. When this value is greater than zero, the model predicts that the class is positive. If we would like to map this value to the [0, 1] range, we can use the sigmoid function for that: public static double[] sigmoid(double[] scores) { double[] result = new double[scores.length]; for (int i = 0; i < result.length; i++) { result[i] = 1 / (1 + Math.exp(-scores[i])); } return result; } Finally, like LIBSVM, LIBLINEAR also outputs a lot of things to standard output. If you do not wish to see it, you can mute it with the following code: PrintStream devNull = new PrintStream(new NullOutputStream()); Linear.setDebugOutput(devNull); Here, we use NullOutputStream from Apache IO which does nothing, so the screen stays clean. When to use LIBSVM and when to use LIBLINEAR? For large datasets often it is not possible to use any kernel methods. In this case you should prefer LIBLINEAR. Additionally, LIBLINEAR is especially good for Text Processing purposes such as Document Classification. Encog Finally, we consider a library for training Neural Networks: Encog. It is available on Maven Central and can be added with the following snippet: <dependency> <groupId>org.encog</groupId> <artifactId>encog-core</artifactId> <version>3.3.0</version> </dependency> With this library you first need to specify the network architecture: BasicNetwork network = new BasicNetwork(); network.addLayer(new BasicLayer(new ActivationSigmoid(), true, noInputNeurons)); network.addLayer(new BasicLayer(new ActivationSigmoid(), true, 30)); network.addLayer(new BasicLayer(new ActivationSigmoid(), true, 1)); network.getStructure().finalizeStructure(); network.reset(); Here we create a network with one input layer, one inner layer with 30 neurons and one output layer with 1 neuron. In each layer we use sigmoid as the activation function and add the bias input (the true parameter). The last line randomly initializes the weights in the network. For both input and output Encog expects two-dimensional double arrays. In case of binary classification we typically have a one dimensional array, so we need to convert it: double[][] X = // data double[] y = // labels double[][] y2d = new double[y.length][]; for (int i = 0; i < y.length; i++) { y2d[i] = new double[] { y[i] }; } Once the data is converted, we wrap it into special wrapper class: MLDataSet dataset = new BasicMLDataSet(X, y2d); Then this dataset can be used for training: MLTrain trainer = new ResilientPropagation(network, dataset); double lambda = 0.01; trainer.addStrategy(new RegularizationStrategy(lambda)); int noEpochs = 101; for (int i = 0; i < noEpochs; i++) { trainer.iteration(); } There are a lot of other Machine Learning libraries available in Java. For example Weka, H2O, JavaML and others. It is not possible to cover all of them, but you can also try them and see if you like them more than the ones we have covered. Evaluation We have covered many Machine Learning libraries, and many of them implement the same algorithms like Random Forest or Logistic Regression. Also, each individual model can have many different parameters: a Logistic Regression has the regularization coefficient, an SVM is configured by setting the kernel and its parameters. How do we select the best single model out of so many possible variants? For that we first define some evaluation metric and then select the model which achieves the best possible performance with respect to this metric. For binary classification we can select one of the following metrics: Accuracy and Error Precision, Recall and F1 AUC (AU ROC) Result Evaluation K-Fold Cross Validation Training, Validation and Testing. Accuracy Accuracy tells us for how many examples the model predicted the correct label. Calculating it is trivial: int n = actual.length; double[] proba = // predictions; double[] prediction = Arrays.stream(proba).map(p -> p > threshold ? 1.0 : 0.0).toArray(); int correct = 0; for (int i = 0; i < n; i++) { if (actual[i] == prediction[i]) { correct++; } } double accuracy = 1.0 * correct / n; Accuracy is the simplest evaluation metric and everybody understands it. Precision, Recall and F1 In some cases accuracy is not the best measure of model performance. For example, suppose we have an unbalanced dataset: there are only 1% of examples that are positive. Then a model which always predict negative is right in 99% cases, and hence will have accuracy of 0.99. But this model is not useful. There are alternatives to accuracy that can overcome this problem. Precision and Recall are among these metrics: they both look at the fraction of positive items that the model correctly recognized. They can be calculated using the Confusion Matrix: a table which summarizes the performance of a binary classifier: Precision is the fraction of correctly predicted positive items among all items the model predicted positive. In terms of the confusion matrix, Precision is TP / (TP + FP). Recall is the fraction of correctly predicted positive items among items that are actually positive. With values from the confusion matrix, Recall is TP / (TP + FN). It is often hard to decide whether one should optimize Precision or Recall. But there is another metric which combines both Precision and Recall into one number, and it is called F1 score. For calculating Precision and Recall, we first need to calculate the values for the cells of the confusion matrix: int tp = 0, tn = 0, fp = 0, fn = 0; for (int i = 0; i < actual.length; i++) { if (actual[i] == 1.0 && proba[i] > threshold) { tp++; } else if (actual[i] == 0.0 && proba[i] <= threshold) { tn++; } else if (actual[i] == 0.0 && proba[i] > threshold) { fp++; } else if (actual[i] == 1.0 && proba[i] <= threshold) { fn++; } } Then we can use the values to calculate Precision and Recall: double precision = 1.0 * tp / (tp + fp); double recall = 1.0 * tp / (tp + fn); Finally, F1 can be calculated using the following formula: double f1 = 2 * precision * recall / (precision + recall); ROC and AU ROC (AUC) The metrics above are good for binary classifiers which produce hard output: they only tell if the class should be assigned a positive label or negative. If instead our model outputs some score such that the higher the values of the score the more likely the item is to be positive, then the binary classifier is called a ranking classifier. Most of the models can output probabilities of belonging to a certain class, and we can use it to rank examples such that the positive are likely to come first. The ROC Curve visually tells us how good a ranking classifier separates positive examples from negative ones. The way a ROC curve is build is the following: We sort the observations by their score and then starting from the origin we go up if the observation is positive and right if it is negative. This way, in the ideal case, we first always go up, and then always go right – and this will result in the best possible ROC curve. In this case we can say that the separation between positive and negative examples is perfect. If the separation is not perfect, but still OK, the curve will go up for positive examples, but sometimes will turn right when a misclassification occurred. Finally, a bad classifier will not be able to tell positive and negative examples apart and the curve would alternate between going up and right. The diagonal line on the plot represents the baseline – the performance that a random classifier would achieve. The further away the curve from the baseline, the better. Unfortunately, there is no available easy-to-use implementation of ROC curves in Java. So the algorithm for drawing a ROC curve is the following: Let POS be number of positive labels, and NEG be the number of negative labels Order data by the score, decreasing Start from (0, 0) For each example in the sorted order, if the example is positive, move 1 / POS up in the graph, otherwise, move 1 / NEG right in the graph. This is a simplified algorithm and assumes that the scores are distinct. If the scores aren't distinct, and there are different actual labels for the same score, some adjustment needs to be made. It is implemented in the class RocCurve which you will find in the source code. You can use it as following: RocCurve.plot(actual, prediction); Calling it will create a plot similar to this one: The area under the curve says how good the separation is. If the separation is very good, then the area will be close to one. But if the classifier cannot distinguish between positive and negative examples, the curve will go around the random baseline curve, and the area will be close to 0.5. Area Under the Curve is often abbreviated as AUC, or, sometimes, AU ROC – to emphasize that the Curve is a ROC Curve. AUC has a very nice interpretation: the value of AUC corresponds to probability that a randomly selected positive example is scored higher than a randomly selected negative example. Naturally, if this probability is high, our classifier does a good job separating positive and negative examples. This makes AUC a to-go evaluation metric for many cases, especially when the dataset is unbalanced – in the sense that there are a lot more examples of one class than another. Luckily, there are implementations of AUC in Java. For example, it is implemented in Smile. You can use it like this: double[] predicted = ... // int[] truth = ... // double auc = AUC.measure(truth, predicted); Result Validation When learning from data there is always the danger of overfitting. Overfitting occurs when the model starts learning the noise in the data instead of detecting useful patterns. It is always important to check if a model overfits – otherwise it will not be useful when applied to unseen data. The typical and most practical way of checking whether a model overfits or not is to emulate “unseen data” – that is, take a part of the available labeled data and do not use it for training. This technique is called “hold out”: we hold out a part of the data and use it only for evaluation. Often we shuffle the original data set before splitting. In many cases we make a simplifying assumption that the order of data is not important – that is, one observation has no influence on another. In this case shuffling the data prior to splitting will remove effects that the order of items might have. On the other hand, if the data is a Time Series data, then shuffling it is not a good idea, because there is some dependence between observations. So, let us implement the hold out split. We assume that the data that we have is already represented by X – a two-dimensional array of doubles with features and y – a one-dimensional array of labels. First, let us create a helper class for holding the data: public class Dataset { private final double[][] X; private final double[] y; // constructor and getters are omitted } Splitting our dataset should produce two datasets, so let us create a class for that as well: public class Split { private final Dataset train; private final Dataset test; // constructor and getters are omitted } Now suppose we want to split the data into two parts: train and test. We also want to specify the size of the train set, we will do it using a testRatio parameter: the percentage of items that should go to the test set. So the first thing we do is generating an array with indexes and then splitting it according to testRatio: int[] indexes = IntStream.range(0, dataset.length()).toArray(); int trainSize = (int) (indexes.length * (1 - testRatio)); int[] trainIndex = Arrays.copyOfRange(indexes, 0, trainSize); int[] testIndex = Arrays.copyOfRange(indexes, trainSize, indexes.length); We can also shuffle the indexes if we want: Random rnd = new Random(seed); for (int i = indexes.length - 1; i > 0; i--) { int index = rnd.nextInt(i + 1); int tmp = indexes[index]; indexes[index] = indexes[i]; indexes[i] = tmp; } Then we can select instances for the training set as follows: int trainSize = trainIndex.length; double[][] trainX = new double[trainSize][]; double[] trainY = new double[trainSize]; for (int i = 0; i < trainSize; i++) { int idx = trainIndex[i]; trainX[i] = X[idx]; trainY[i] = y[idx]; } And then finally wrap it into our Dataset class: Dataset train = new Dataset(trainX, trainY); If we repeat the same for the test set, we can put both train and test sets into a Split object: Split split = new Split(train, test); And now we can use train fold for training and test fold for testing the models. If we put all the code above into a function of the Dataset class, for example, trainTestSplit, we can use it as follows: Split split = dataset.trainTestSplit(0.2); Dataset train = split.getTrain(); // train the model using train.getX() and train.getY() Dataset test = split.getTest(); // test the model using test.getX(); test.getY(); K-Fold Cross Validation Holding out only one part of the data may not always be the best option. What we can do instead is splitting it into K parts and then testing the models only on 1/Kth of the data. This is called K-Fold Cross-Validation: it not only gives the performance estimation, but also the possible spread of the error. Typically we are interested in models which give good and consistent performance. K-Fold Cross-Validation helps us to select such models. Then we prepare the data for K-Fold Cross-Validation is the following: First, split the data into K parts Then for each of these parts Take one part as the validation set Take the remaining K-1 parts as the training set If we translate this into Java, the first step will look like this: int[] indexes = IntStream.range(0, dataset.length()).toArray(); int[][] foldIndexes = new int[k][]; int step = indexes.length / k; int beginIndex = 0; for (int i = 0; i < k - 1; i++) { foldIndexes[i] = Arrays.copyOfRange(indexes, beginIndex, beginIndex + step); beginIndex = beginIndex + step; } foldIndexes[k - 1] = Arrays.copyOfRange(indexes, beginIndex, indexes.length); This creates an array of indexes for each fold. You can also shuffle the indexes array as previously. Now we can create splits from each fold: List<Split> result = new ArrayList<>(); for (int i = 0; i < k; i++) { int[] testIdx = folds[i]; int[] trainIdx = combineTrainFolds(folds, indexes.length, i); result.add(Split.fromIndexes(dataset, trainIdx, testIdx)); } In the code above we have two additional methods: combineTrainFolds K-1 arrays with indexes and combines them into one Split.fromIndexes creates a split gives train and test indexes. We have already covered the second function when we created a simple hold-out test set. And the first function, combineTrainFolds, is implemented like this: private static int[] combineTrainFolds(int[][] folds, int totalSize, int excludeIndex) { int size = totalSize - folds[excludeIndex].length; int result[] = new int[size]; int start = 0; for (int i = 0; i < folds.length; i++) { if (i == excludeIndex) { continue; } int[] fold = folds[i]; System.arraycopy(fold, 0, result, start, fold.length); start = start + fold.length; } return result; } Again, we can put the code above into a function of the Dataset class and call it like follows: List<Split> folds = train.kfold(3); Now when we have a list of Split objects, we can create a special function for performing Cross-Validation: public static DescriptiveStatistics crossValidate(List<Split> folds, Function<Dataset, Model> trainer) { double[] aucs = folds.parallelStream().mapToDouble(fold -> { Dataset foldTrain = fold.getTrain(); Dataset foldValidation = fold.getTest(); Model model = trainer.apply(foldTrain); return auc(model, foldValidation); }).toArray(); return new DescriptiveStatistics(aucs); } What this function does takes a list of folds and a callback which inside creates a model. Then, after the model is trained, we calculate AUC for it. Additionally, we take advantage of Java's ability to parallelize loops and train models on each fold at the same time. Finally, we put the AUCs calculated on each fold into a DescriptiveStatistics object, which can later on be used to return the mean and the standard deviation of the AUCs. As you probably remember, the DescriptiveStatistics class comes from the Apache Commons Math library. Let us consider an example. Suppose we want to use Logistic Regression from LIBLINEAR and select the best value for the regularization parameter C. We can use the function above this way: double[] Cs = { 0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0 }; for (double C : Cs) { DescriptiveStatistics summary = crossValidate(folds, fold -> { Parameter param = new Parameter(SolverType.L1R_LR, C, 0.0001); return LibLinear.train(fold, param); }); double mean = summary.getMean(); double std = summary.getStandardDeviation(); System.out.printf("L1 logreg C=%7.3f, auc=%.4f ± %.4f%n", C, mean, std); } Here, LibLinear.train is a helper method which takes a Dataset object and a Parameter object and then trains a LIBLINEAR model. This will print AUC for all provided values of C, so you can see which one is the best, and pick the one with highest mean AUC. Training, Validation and Testing When doing the Cross-Validation there’s still a danger of overfitting. Since we try a lot of different experiments on the same validation set, we might accidentally pick the model which just happened to do well on the validation set – but it may later on fail to generalize to unseen data. The solution to this problem is to hold out a test set at the very beginning and do not touch it at all until we select what we think is the best model. And we use it only for evaluating the final model on it. So how do we select the best model? What we can do is to do Cross-Validation on the remaining train data. It can be either hold out or K-Fold Cross-Validation. In general you should prefer doing K-Fold Cross-Validation because it also gives you the spread of performance, and you may use it in for model selection as well. The typical workflow should be the following: (0) Select some metric for validation, e.g. accuracy or AUC. (1) Split all the data into train and test sets (2) Split the training data further and hold out a validation dataset or split it into K folds (3) Use the validation data for model selection and parameter optimization (4) Select the best model according to the validation set and evaluate it against the hold out test set It is important to avoid looking at the test set too often. It should be used only occasionally for final evaluation to make sure the selected model does not overfit. If the validation scheme is set up properly, the validation score should correspond to the final test score. If this happens, we can be sure that the model does not overfit and is able to generalize to unseen data. Using the classes and the code we created previously, it translates to the following Java code: Dataset data = new Dataset(X, y); Dataset train = split.getTrain(); List<Split> folds = train.kfold(3); // now use crossValidate(folds, ...) to select the best model Dataset test = split.getTest(); // do final evaluation of the best model on test With this information we are ready to do a project on Binary Classification. Summary In this article we spoke about supervised machine learning and about two common supervised problems: Classification and Regression. We also covered the libraries which are commonly used algorithms , implemented and how to evaluate the performance of these algorithms. There is another family of Machine Learning algorithms that do not require the label information: these methods are called Unsupervised Learning. Resources for Article: Further resources on this subject: Supervised Machine Learning [article] Clustering and Other Unsupervised Learning Methods [article] Data Clustering [article]
Read more
  • 0
  • 0
  • 3465

article-image-learning-cassandra
Packt
06 Apr 2017
17 min read
Save for later

Learning Cassandra

Packt
06 Apr 2017
17 min read
In this article by Sandeep Yarabarla, the author of the book Learning Apache Cassandra - Second Edition, we will built products using relational databases like MySQL and PostgreSQL, and perhaps experimented with NoSQL databases including a document store like MongoDB or a key-value store like Redis. While each of these tools has its strengths, you will now consider whether a distributed database like Cassandra might be the best choice for the task at hand. In this article, we'll begin with the need for NoSQL databases to satisfy the conundrum of ever growing data. We will see why NoSQL databases are becoming the de facto choice for big data and real-time web applications. We will also talk about the major reasons to choose Cassandra from among the many database options available to you. Having established that Cassandra is a great choice, we'll go through the nuts and bolts of getting a local Cassandra installation up and running (For more resources related to this topic, see here.) What is Big Data Big Data is a relatively new term which has been gathering steam over the past few years. Big Data is a term used for data sets that are relatively large to be stored in a traditional database system or processed by traditional data processing pipelines. This data could be structured, semi-structured or unstructured data. The data sets that belong to this category usually scale to terabytes or petabytes of data. Big Data usually involves one or more of the following: Velocity: Data moves at an unprecedented speed and must be dealt with it in a timely manner. Volume: Organizations collect data from a variety of sources, including business transactions, social media and information from sensor or machine-to-machine data. This could involve terabytes to petabytes of data. In the past, storing it would've been a problem – but new technologies have eased the burden. Variety: Data comes in all sorts of formats ranging from structured data to be stored in traditional databases to unstructured data (blobs) such as images, audio files and text files. These are known as the 3 Vs of Big Data. In addition to these, we tend to associate another term with Big Data. Complexity: Today's data comes from multiple sources, which makes it difficult to link, match, cleanse and transform data across systems. However, it's necessary to connect and correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out of control. It must be able to traverse multiple data centers, cloud and geographical zones. Challenges of modern applications Before we delve into the shortcomings of relational systems to handle Big Data, let's take a look at some of the challenges faced by modern web facing and Big Data applications. Later, this will give an insight into how NoSQL data stores or Cassandra in particular helps solve these issues. One of the most important challenges faced by a web facing application is it should be able to handle a large number of concurrent users. Think of a search engine like google which handles millions of concurrent users at any given point of time or a large online retailer. The response from these applications should be swift even as the number of users keeps on growing. Modern applications need to be able to handle large amounts of data which can scale to several petabytes of data and beyond. Consider a large social network with a few hundred million users: Think of the amount of data generated in tracking and managing those users Think of how this data can be used for analytics Business critical applications should continue running without much impact even when there is a system failure or multiple system failures (server failure, network failure, and so on.). The applications should be able to handle failures gracefully without any data loss or interruptions. These applications should be able to scale across multiple data centers and geographical regions to support customers from different regions around the world with minimum delay. Modern applications should be implementing fully distributed architectures and should be capable of scaling out horizontally to support any data size or any number of concurrent users. Why not relational Relational database systems (RDBMS) have been the primary data store for enterprise applications for 20 years. Lately, NoSQL databases have been picking a lot of steam and businesses are slowly seeing a shift towards non-relational databases. There are a few reasons why relational databases don't seem like a good fit for modern web applications. Relational databases are not designed for clustered solutions. There are some solutions that shard data across servers but these are fragile, complex and generally don't work well. Sharding solutions implemented by RDBMS MySQL's product MySQL cluster provides clustering support which adds many capabilities of non-relational systems. It is actually a NoSQL solution that integrates with MySQL relational database. It partitions the data onto multiple nodes and the data can be accessed via different APIs. Oracle provides a clustering solution, Oracle RAC which involves multiple nodes running an oracle process accessing the same database files. This creates a single point of failure as well as resource limitations in accessing the database itself. They are not a good fit current hardware and architectures. Relational databases are usually scaled up by using larger machines with more powerful hardware and maybe clustering and replication among a small number of nodes. Their core architecture is not a good fit for commodity hardware and thus doesn't work with scale out architectures. Scale out vs Scale up architecture Scaling out means adding more nodes to a system such as adding more servers to a distributed database or filesystem. This is also known as horizontal scaling. Scaling up means adding more resources to a single node within the system such as adding more CPU, memory or disks to a server. This is also known as vertical scaling. How to handle Big Data Now that we are convinced the relational model is not a good fit for Big Data, let's try to figure out ways to handle Big Data. These are the solutions that paved the way for various NoSQL databases. Clustering: The data should be spread across different nodes in a cluster. The data should be replicated across multiple nodes in order to sustain node failures. This helps spread the data across the cluster and different nodes contain different subsets of data. This improves performance and provides fault tolerance. A node is an instance of database software running on a server. Multiple instances of the same database could be running on the same server. Flexible schema: Schemas should be flexible unlike the relational model and should evolve with data. Relax consistency: We should embrace the concept of eventual consistency which means data will eventually be propagated to all the nodes in the cluster (in case of replication). Eventual consistency allows data replication across nodes with minimum overhead. This allows for fast writes with the need for distributed locking. De-normalization of data: De-normalize data to optimize queries. This has to be done at the cost of writing and maintaining multiple copies of the same data. What is Cassandra and why Cassandra Cassandra is a fully distributed, masterless database, offering superior scalability and fault tolerance to traditional single master databases. Compared with other popular distributed databases like Riak, HBase, and Voldemort, Cassandra offers a uniquely robust and expressive interface for modeling and querying data. What follows is an overview of several desirable database capabilities, with accompanying discussions of what Cassandra has to offer in each category. Horizontal scalability Horizontal scalability refers to the ability to expand the storage and processing capacity of a database by adding more servers to a database cluster. A traditional single-master database's storage capacity is limited by the capacity of the server that hosts the master instance. If the data set outgrows this capacity, and a more powerful server isn't available, the data set must be sharded among multiple independent database instances that know nothing of each other. Your application bears responsibility for knowing to which instance a given piece of data belongs. Cassandra, on the other hand, is deployed as a cluster of instances that are all aware of each other. From the client application's standpoint, the cluster is a single entity; the application need not know, nor care, which machine a piece of data belongs to. Instead, data can be read or written to any instance in the cluster, referred to as a node; this node will forward the request to the instance where the data actually belongs. The result is that Cassandra deployments have an almost limitless capacity to store and process data. When additional capacity is required, more machines can simply be added to the cluster. When new machines join the cluster, Cassandra takes careof rebalancing the existing data so that each node in the expanded cluster has a roughly equal share. Also, the performance of a Cassandra cluster is directly proportional to the number of nodes within the cluster. As you keep on adding instances, the read and write throughput will keep increasing proportionally. Cassandra is one of the several popular distributed databases inspired by the Dynamo architecture, originally published in a paper by Amazon. Other widely used implementations of Dynamo include Riak and Voldemort. You can read the original paper at http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf High availability The simplest database deployments are run as a single instance on a single server. This sort of configuration is highly vulnerable to interruption: if the server is affected by a hardware failure or network connection outage, the application's ability to read and write data is completely lost until the server is restored. If the failure is catastrophic, the data on that server might be lost completely. Master-slave architectures improve this picture a bit. The master instance receives all write operations, and then these operations are replicated to follower instances. The application can read data from the master or any of the follower instances, so a single host becoming unavailable will not prevent the application from continuing to read data. A failure of the master, however, will still prevent the application from performing any write operations, so while this configuration provides high read availability, it doesn't completely provide high availability. Cassandra, on the other hand, has no single point of failure for reading or writing data. Each piece of data is replicated to multiple nodes, but none of these nodes holds the authoritative master copy. All the nodes in a Cassandra cluster are peers without a master node. If a machine becomes unavailable, Cassandra will continue writing data to the other nodes that share data with that machine, and will queue the operations and update the failed node when it rejoins the cluster. This means in a typical configuration, multiple nodes must fail simultaneously for there to be any application-visible interruption in Cassandra's availability. How many copies? When you create a keyspace - Cassandra's version of a database, you specify how many copies of each piece of data should be stored; this is called the replication factor. A replication factor of 3 is a common choice for most use cases. Write optimization Traditional relational and document databases are optimized for read performance. Writing data to a relational database will typically involve making in-place updates to complicated data structures on disk, in order to maintain a data structure that can be read efficiently and flexibly. Updating these data structures is a very expensive operation from a standpoint of disk I/O, which is often the limiting factor for database performance. Since writes are more expensive than reads, you'll typically avoid any unnecessary updates to a relational database, even at the expense of extra read operations. Cassandra, on the other hand, is highly optimized for write throughput, and in fact never modifies data on disk; it only appends to existing files or creates new ones. This is much easier on disk I/O and means that Cassandra can provide astonishingly high write throughput. Since both writing data to Cassandra, and storing data in Cassandra, are inexpensive, denormalization carries little cost and is a good way to ensure that data can be efficiently read in various access scenarios. Because Cassandra is optimized for write volume; you shouldn't shy away from writing data to the database. In fact, it's most efficient to write without reading whenever possible, even if doing so might result in redundant updates. Just because Cassandra is optimized for writes doesn't make it bad at reads; in fact, a well-designed Cassandra database can handle very heavy read loads with no problem. Structured records The first three database features we looked at are commonly found in distributed data stores. However, databases like Riak and Voldemort are purely key-value stores; these databases have no knowledge of the internal structure of a record that's stored at a particular key. This means useful functions like updating only part of a record, reading only certain fields from a record, or retrieving records that contain a particular value in a given field are not possible. Relational databases like PostgreSQL, document stores like MongoDB, and, to a limited extent, newer key-value stores like Redis do have a concept of the internal structure of their records, and most application developers are accustomed to taking advantage of the possibilities this allows. None of these databases, however, offer the advantages of a masterless distributed architecture. In Cassandra, records are structured much in the same way as they are in a relational database—using tables, rows, and columns. Thus, applications using Cassandra can enjoy all the benefits of masterless distributed storage while also getting all the advanced data modeling and access features associated with structured records. Secondary indexes A secondary index, commonly referred to as an index in the context of a relational database, is a structure allowing efficient lookup of records by some attribute other than their primary key. This is a widely useful capability: for instance, when developing a blog application, you would want to be able to easily retrieve all of the posts written by a particular author. Cassandra supports secondary indexes; while Cassandra's version is not as versatile as indexes in a typical relational database, it's a powerful feature in the right circumstances. Materialized views Data modelling principles in Cassandra compel us to denormalize data as much as possible. Prior to Cassandra 3.0, the only way to query on a non-primary key column was to create a secondary index and query on it. However, secondary indexes have a performance tradeoff if it contains high cardinality data. Often, high cardinality secondary indexes have to scan data on all the nodes and aggregate them to return the query results. This defeats the purpose of having a distributed system. To avoid secondary indexes and client-side denormalization, Cassandra introduced the feature of Materialized views which does server-side denormalization. You can create views for a base table and Cassandra ensures eventual consistency between the base and view. This lets us do very fast lookups on each view following the normal Cassandra read path. Materialized views maintain a correspondence of one CQL row each in the base and the view, so we need to ensure that each CQL row which is required for the views will be reflected in the base table's primary keys. Although, materialized view allows for fast lookups on non-primary key indexes; this comes at a performance hit to writes. Efficient result ordering It's quite common to want to retrieve a record set ordered by a particular field; for instance, a photo sharing service will want to retrieve the most recent photographs in descending order of creation. Since sorting data on the fly is a fundamentally expensive operation, databases must keep information about record ordering persisted on disk in order to efficiently return results in order. In a relational database, this is one of the jobs of a secondary index. In Cassandra, secondary indexes can't be used for result ordering, but tables can be structured such that rows are always kept sorted by a given column or columns, called clustering columns. Sorting by arbitrary columns at read time is not possible, but the capacity to efficiently order records in any way, and to retrieve ranges of records based on this ordering, is an unusually powerful capability for a distributed database. Immediate consistency When we write a piece of data to a database, it is our hope that that data is immediately available to any other process that may wish to read it. From another point of view, when we read some data from a database, we would like to be guaranteed that the data we retrieve is the most recently updated version. This guarantee is calledimmediate consistency, and it's a property of most common single-master databases like MySQL and PostgreSQL. Distributed systems like Cassandra typically do not provide an immediate consistency guarantee. Instead, developers must be willing to accept eventual consistency, which means when data is updated, the system will reflect that update at some point in the future. Developers are willing to give up immediate consistency precisely because it is a direct tradeoff with high availability. In the case of Cassandra, that tradeoff is made explicit through tunable consistency. Each time you design a write or read path for data, you have the option of immediate consistency with less resilient availability, or eventual consistency with extremely resilient availability. Discretely writable collections While it's useful for records to be internally structured into discrete fields, a given property of a record isn't always a single value like a string or an integer. One simple way to handle fields that contain collections of values is to serialize them using a format like JSON, and then save the serialized collection into a text field. However, in order to update collections stored in this way, the serialized data must be read from the database, decoded, modified, and then written back to the database in its entirety. If two clients try to perform this kind of modification to the same record concurrently, one of the updates will be overwritten by the other. For this reason, many databases offer built-in collection structures that can be discretely updated: values can be added to, and removed from collections, without reading and rewriting the entire collection. Neither the client nor Cassandra itself needs to read the current state of the collection in order to update it, meaning collection updates are also blazingly efficient. Relational joins In real-world applications, different pieces of data relate to each other in a variety of ways. Relational databases allow us to perform queries that make these relationships explicit, for instance, to retrieve a set of events whose location is in the state of New York (this is assuming events and locations are different record types). Cassandra, however, is not a relational database, and does not support anything like joins. Instead, applications using Cassandra typically denormalize data and make clever use of clustering in order to perform the sorts of data access that would use a join in a relational database. For datasets that aren't already denormalized, applications can also perform client-side joins, which mimic the behavior of a relational database by performing multiple queries and joining the results at the application level. Client-side joins are less efficient than reading data that has been denormalized in advance, but offer more flexibility. Summary In this article, you explored the reasons to choose Cassandra from among the many databases available, and having determined that Cassandra is a great choice, you installed it on your development machine. You had your first taste of the Cassandra Query Language when you issued your first few commands through the CQL shell in order to create a keyspace, table, and insert and read data. You're now poised to begin working with Cassandra in earnest. Resources for Article: Further resources on this subject: Setting up a Kubernetes Cluster [article] Dealing with Legacy Code [article] Evolving the data model [article]
Read more
  • 0
  • 0
  • 1147