Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-machine-learning-r
Packt
16 Feb 2016
39 min read
Save for later

Machine Learning with R

Packt
16 Feb 2016
39 min read
If science fiction stories are to be believed, the invention of artificial intelligence inevitably leads to apocalyptic wars between machines and their makers. In the early stages, computers are taught to play simple games of tic-tac-toe and chess. Later, machines are given control of traffic lights and communications, followed by military drones and missiles. The machines' evolution takes an ominous turn once the computers become sentient and learn how to teach themselves. Having no more need for human programmers, humankind is then "deleted." Thankfully, at the time of this writing, machines still require user input. Though your impressions of machine learning may be colored by these mass media depictions, today's algorithms are too application-specific to pose any danger of becoming self-aware. The goal of today's machine learning is not to create an artificial brain, but rather to assist us in making sense of the world's massive data stores. Putting popular misconceptions aside, by the end of this article, you will gain a more nuanced understanding of machine learning. You also will be introduced to the fundamental concepts that define and differentiate the most commonly used machine learning approaches. (For more resources related to this topic, see here.) You will learn: The origins and practical applications of machine learning How computers turn data into knowledge and action How to match a machine learning algorithm to your data The field of machine learning provides a set of algorithms that transform data into actionable knowledge. Keep reading to see how easy it is to use R to start applying machine learning to real-world problems. The origins of machine learning Since birth, we are inundated with data. Our body's sensors—the eyes, ears, nose, tongue, and nerves—are continually assailed with raw data that our brain translates into sights, sounds, smells, tastes, and textures. Using language, we are able to share these experiences with others. From the advent of written language, human observations have been recorded. Hunters monitored the movement of animal herds, early astronomers recorded the alignment of planets and stars, and cities recorded tax payments, births, and deaths. Today, such observations, and many more, are increasingly automated and recorded systematically in the ever-growing computerized databases. The invention of electronic sensors has additionally contributed to an explosion in the volume and richness of recorded data. Specialized sensors see, hear, smell, taste, and feel. These sensors process the data far differently than a human being would. Unlike a human's limited and subjective attention, an electronic sensor never takes a break and never lets its judgment skew its perception. Although sensors are not clouded by subjectivity, they do not necessarily report a single, definitive depiction of reality. Some have an inherent measurement error, due to hardware limitations. Others are limited by their scope. A black and white photograph provides a different depiction of its subject than one shot in color. Similarly, a microscope provides a far different depiction of reality than a telescope. Between databases and sensors, many aspects of our lives are recorded. Governments, businesses, and individuals are recording and reporting information, from the monumental to the mundane. Weather sensors record temperature and pressure data, surveillance cameras watch sidewalks and subway tunnels, and all manner of electronic behaviors are monitored: transactions, communications, friendships, and many others. This deluge of data has led some to state that we have entered an era of Big Data, but this may be a bit of a misnomer. Human beings have always been surrounded by large amounts of data. What makes the current era unique is that we have vast amounts of recorded data, much of which can be directly accessed by computers. Larger and more interesting data sets are increasingly accessible at the tips of our fingers, only a web search away. This wealth of information has the potential to inform action, given a systematic way of making sense from it all. The field of study interested in the development of computer algorithms to transform data into intelligent action is known as machine learning. This field originated in an environment where available data, statistical methods, and computing power rapidly and simultaneously evolved. Growth in data necessitated additional computing power, which in turn spurred the development of statistical methods to analyze large datasets. This created a cycle of advancement, allowing even larger and more interesting data to be collected.   A closely related sibling of machine learning, data mining, is concerned with the generation of novel insights from large databases. As the implies, data mining involves a systematic hunt for nuggets of actionable intelligence. Although there is some disagreement over how widely machine learning and data mining overlap, a potential point of distinction is that machine learning focuses on teaching computers how to use data to solve a problem, while data mining focuses on teaching computers to identify patterns that humans then use to solve a problem. Virtually all data mining involves the use of machine learning, but not all machine learning involves data mining. For example, you might apply machine learning to data mine automobile traffic data for patterns related to accident rates; on the other hand, if the computer is learning how to drive the car itself, this is purely machine learning without data mining. The phrase "data mining" is also sometimes used as a pejorative to describe the deceptive practice of cherry-picking data to support a theory. Uses and abuses of machine learning Most people have heard of the chess-playing computer Deep Blue—the first to win a game against a world champion—or Watson, the computer that defeated two human opponents on the television trivia game show Jeopardy. Based on these stunning accomplishments, some have speculated that computer intelligence will replace humans in many information technology occupations, just as machines replaced humans in the fields, and robots replaced humans on the assembly line. The truth is that even as machines reach such impressive milestones, they are still relatively limited in their ability to thoroughly understand a problem. They are pure intellectual horsepower without direction. A computer may be more capable than a human of finding subtle patterns in large databases, but it still needs a human to motivate the analysis and turn the result into meaningful action. Machines are not good at asking questions, or even knowing what questions to ask. They are much better at answering them, provided the question is stated in a way the computer can comprehend. Present-day machine learning algorithms partner with people much like a bloodhound partners with its trainer; the dog's sense of smell may be many times stronger than its master's, but without being carefully directed, the hound may end up chasing its tail.   To better understand the real-world applications of machine learning, we'll now consider some cases where it has been used successfully, some places where it still has room for improvement, and some situations where it may do more harm than good. Machine learning successes Machine learning is most successful when it augments rather than replaces the specialized knowledge of a subject-matter expert. It works with medical doctors at the forefront of the fight to eradicate cancer, assists engineers and programmers with our efforts to create smarter homes and automobiles, and helps social scientists build knowledge of how societies function. Toward these ends, it is employed in countless businesses, scientific laboratories, hospitals, and governmental organizations. Any organization that generates or aggregates data likely employs at least one machine learning algorithm to help make sense of it. Though it is impossible to list every use case of machine learning, a survey of recent success stories includes several prominent applications: Identification of unwanted spam messages in e-mail Segmentation of customer behavior for targeted advertising Forecasts of weather behavior and long-term climate changes Reduction of fraudulent credit card transactions Actuarial estimates of financial damage of storms and natural disasters Prediction of popular election outcomes Development of algorithms for auto-piloting drones and self-driving cars Optimization of energy use in homes and office buildings Projection of areas where criminal activity is most likely Discovery of genetic sequences linked to diseases The limits of machine learning Although machine learning is used widely and has tremendous potential, it is important to understand its limits. Machine learning, at this time, is not in any way a substitute for a human brain. It has very little flexibility to extrapolate outside of the strict parameters it learned and knows no common sense. With this in mind, one should be extremely careful to recognize exactly what the algorithm has learned before setting it loose in the real-world settings. Without a lifetime of past experiences to build upon, computers are also limited in their ability to make simple common sense inferences about logical next steps. Take, for instance, the banner advertisements seen on many web sites. These may be served, based on the patterns learned by data mining the browsing history of millions of users. According to this data, someone who views the websites selling shoes should see advertisements for shoes, and those viewing websites for mattresses should see advertisements for mattresses. The problem is that this becomes a never-ending cycle in which additional shoe or mattress advertisements are served rather than advertisements for shoelaces and shoe polish, or bed sheets and blankets. Many are familiar with the deficiencies of machine learning's ability to understand or translate language or to recognize speech and handwriting. Perhaps the earliest example of this type of failure is in a 1994 episode of the television show, The Simpsons, which showed a parody of the Apple Newton tablet. For its time, the Newton was known for its state-of-the-art handwriting recognition. Unfortunately for Apple, it would occasionally fail to great effect. The television episode illustrated this through a sequence in which a bully's note to Beat up Martin was misinterpreted by the Newton as Eat up Martha, as depicted in the following screenshots:   Screenshots from "Lisa on Ice" The Simpsons, 20th Century Fox (1994) Machines' ability to understand language has improved enough since 1994, such that Google, Apple, and Microsoft are all confident enough to offer virtual concierge services operated via voice recognition. Still, even these services routinely struggle to answer relatively simple questions. Even more, online translation services sometimes misinterpret sentences that a toddler would readily understand. The predictive text feature on many devices has also led to a number of humorous autocorrect fail sites that illustrate the computer's ability to understand basic language but completely misunderstand context. Some of these mistakes are to be expected, for sure. Language is complicated with multiple layers of text and subtext and even human beings, sometimes, understand the context incorrectly. This said, these types of failures in machines illustrate the important fact that machine learning is only as good as the data it learns from. If the context is not directly implicit in the input data, then just like a human, the computer will have to make its best guess. Machine learning ethics At its core, machine learning is simply a tool that assists us in making sense of the world's complex data. Like any tool, it can be used for good or evil. Machine learning may lead to problems when it is applied so broadly or callously that humans are treated as lab rats, automata, or mindless consumers. A process that may seem harmless may lead to unintended consequences when automated by an emotionless computer. For this reason, those using machine learning or data mining would be remiss not to consider the ethical implications of the art. Due to the relative youth of machine learning as a discipline and the speed at which it is progressing, the associated legal issues and social norms are often quite uncertain and constantly in flux. Caution should be exercised while obtaining or analyzing data in order to avoid breaking laws, violating terms of service or data use agreements, and abusing the trust or violating the privacy of customers or the public. The informal corporate motto of Google, an organization that collects perhaps more data on individuals than any other, is "don't be evil." While this seems clear enough, it may not be sufficient. A better approach may be to follow the Hippocratic Oath, a medical principle that states "above all, do no harm." Retailers routinely use machine learning for advertising, targeted promotions, inventory management, or the layout of the items in the store. Many have even equipped checkout lanes with devices that print coupons for promotions based on the customer's buying history. In exchange for a bit of personal data, the customer receives discounts on the specific products he or she wants to buy. At first, this appears relatively harmless. But consider what happens when this practice is taken a little bit further. One possibly apocryphal tale concerns a large retailer in the U.S. that employed machine learning to identify expectant mothers for coupon mailings. The retailer hoped that if these mothers-to-be received substantial discounts, they would become loyal customers, who would later purchase profitable items like diapers, baby formula, and toys. Equipped with machine learning methods, the retailer identified items in the customer purchase history that could be used to predict with a high degree of certainty, not only whether a woman was pregnant, but also the approximate timing for when the baby was due. After the retailer used this data for a promotional mailing, an angry man contacted the chain and demanded to know why his teenage daughter received coupons for maternity items. He was furious that the retailer seemed to be encouraging teenage pregnancy! As the story goes, when the retail chain's manager called to offer an apology, it was the father that ultimately apologized because, after confronting his daughter, he discovered that she was indeed pregnant! Whether completely true or not, the lesson learned from the preceding tale is that common sense should be applied before blindly applying the results of a machine learning analysis. This is particularly true in cases where sensitive information such as health data is concerned. With a bit more care, the retailer could have foreseen this scenario, and used greater discretion while choosing how to reveal the pattern its machine learning analysis had discovered. Certain jurisdictions may prevent you from using racial, ethnic, religious, or other protected class data for business reasons.  Keep in mind that excluding this data from your analysis may not be enough, because machine learning algorithms might inadvertently learn this information independently. For instance, if a certain segment of people generally live in a certain region, buy a certain product, or otherwise behave in a way that uniquely identifies them as a group, some machine learning algorithms can infer the protected information from these other factors. In such cases, you may need to fully "de-identify" these people by excluding any potentially identifying data in addition to the protected information. Apart from the legal consequences, using data inappropriately may hurt the bottom line. Customers may feel uncomfortable or become spooked if the aspects of their lives they consider private are made public. In recent years, several high-profile web applications have experienced a mass exodus of users who felt exploited when the applications' terms of service agreements changed, and their data was used for purposes beyond what the users had originally agreed upon. The fact that privacy expectations differ by context, age cohort, and locale adds complexity in deciding the appropriate use of personal data. It would be wise to consider the cultural implications of your work before you begin your project. The fact that you can use data for a particular end does not always mean that you should. How machines learn A formal definition of machine learning proposed by computer scientist Tom M. Mitchellstates that a machine learns whenever it is able to utilize its an experience such that its performance improves on similar experiences in the future. Although this definition is intuitive, it completely ignores the process of exactly how experience can be translated into future action—and of course learning is always easier said than done! While human brains are naturally capable of learning from birth, the conditions necessary for computers to learn must be made explicit. For this reason, although it is not strictly necessary to understand the theoretical basis of learning, this foundation helps understand, distinguish, and implement machine learning algorithms. As you compare machine learning to human learning, you may discover yourself examining your own mind in a different light. Regardless of whether the learner is a human or machine, the basic learning process is similar. It can be divided into four interrelated components: Data storage utilizes observation, memory, and recall to provide a factual basis for further reasoning. Abstraction involves the translation of stored data into broader representations and concepts. Generalization uses abstracted data to create knowledge and inferences that drive action in new contexts. Evaluation provides a feedback mechanism to measure the utility of learned knowledge and inform potential improvements.  Keep in mind that although the learning process has been conceptualized as four distinct components, they are merely organized this way for illustrative purposes. In reality, the entire learning process is inextricably linked. In human beings, the process occurs subconsciously. We recollect, deduce, induct, and intuit with the confines of our mind's eye, and because this process is hidden, any differences from person to person are attributed to a vague notion of subjectivity. In contrast, with computers these processes are explicit, and because the entire process is transparent, the learned knowledge can be examined, transferred, and utilized for future action. Data storage All learning must begin with data. Humans and computers alike utilize data storage as a foundation for more advanced reasoning. In a human being, this consists of a brain that uses electrochemical signals in a network of biological cells to store and process observations for short- and long-term future recall. Computers have similar capabilities of short- and long-term recall using hard disk drives, flash memory, and random access memory (RAM) in combination with a central processing unit (CPU). It may seem obvious to say so, but the ability to store and retrieve data alone is not sufficient for learning. Without a higher level of understanding, knowledge is limited exclusively to recall, meaning exclusively what is seen before and nothing else. The data is merely ones and zeros on a disk. They are stored memories with no broader meaning. To better understand the nuances of this idea, it may help to think about the last time you studied for a difficult test, perhaps for a university final exam or a career certification. Did you wish for an eidetic (photographic) memory? If so, you may be disappointed to learn that perfect recall is unlikely to be of much assistance. Even if you could memorize material perfectly, your rote learning is of no use, unless you know in advance the exact questions and answers that will appear in the exam. Otherwise, you would be stuck in an attempt to memorize answers to every question that could conceivably be asked. Obviously, this is an unsustainable strategy. Instead, a better approach is to spend time selectively, memorizing a small set of representative ideas while developing strategies on how the ideas relate and how to use the stored information. In this way, large ideas can be understood without needing to memorize them by rote. Abstraction This work of assigning meaning to stored data occurs during the abstraction process, in which raw data comes to have a more abstract meaning. This type of connection, say between an object and its representation, is exemplified by the famous René Magritte painting The Treachery of Images:   Source: http://collections.lacma.org/node/239578 The painting depicts a tobacco pipe with the caption Ceci n'est pas une pipe ("this is not a pipe"). The point Magritte was illustrating is that a representation of a pipe is not truly a pipe. Yet, in spite of the fact that the pipe is not real, anybody viewing the painting easily recognizes it as a pipe. This suggests that the observer's mind is able to connect the picture of a pipe to the idea of a pipe, to a memory of a physical pipe that could be held in the hand. Abstracted connections like these are the basis of knowledge representation, the formation of logical structures that assist in turning raw sensory information into a meaningful insight. During a machine's process of knowledge representation, the computer summarizes stored raw data using a model, an explicit description of the patterns within the data. Just like Magritte's pipe, the model representation takes on a life beyond the raw data. It represents an idea greater than the sum of its parts. There are many different types of models. You may be already familiar with some. Examples include: Mathematical equations Relational diagrams such as trees and graphs Logical if/else rules Groupings of data known as clusters The choice of model is typically not left up to the machine. Instead, the learning task and data on hand inform model selection. Later in this article, we will discuss methods to choose the type of model in more detail. The process of fitting a model to a dataset is known as training. When the model has been trained, the data is transformed into an abstract form that summarizes the original information. You might wonder why this step is called training rather than learning. First, note that the process of learning does not end with data abstraction; the learner must still generalize and evaluate its training. Second, the word training better connotes the fact that the human teacher trains the machine student to understand the data in a specific way. It is important to note that a learned model does not itself provide new data, yet it does result in new knowledge. How can this be? The answer is that imposing an assumed structure on the underlying data gives insight into the unseen by supposing a concept about how data elements are related. Take for instance the discovery of gravity. By fitting equations to observational data, Sir Isaac Newton inferred the concept of gravity. But the force we now know as gravity was always present. It simply wasn't recognized until Newton recognized it as an abstract concept that relates some data to others—specifically, by becoming the g term in a model that explains observations of falling objects.   Most models may not result in the development of theories that shake up scientific thought for centuries. Still, your model might result in the discovery of previously unseen relationships among data. A model trained on genomic data might find several genes that, when combined, are responsible for the onset of diabetes; banks might discover a seemingly innocuous type of transaction that systematically appears prior to fraudulent activity; and psychologists might identify a combination of personality characteristics indicating a new disorder. These underlying patterns were always present, but by simply presenting information in a different format, a new idea is conceptualized. Generalization The learning process is not complete until the learner is able to use its abstracted knowledge for future action. However, among the countless underlying patterns that might be identified during the abstraction process and the myriad ways to model these patterns, some will be more useful than others. Unless the production of abstractions is limited, the learner will be unable to proceed. It would be stuck where it started—with a large pool of information, but no actionable insight. The term generalization describes the process of turning abstracted knowledge into a form that can be utilized for future action, on tasks that are similar, but not identical, to those it has seen before. Generalization is a somewhat vague process that is a bit difficult to describe. Traditionally, it has been imagined as a search through the entire set of models (that is, theories or inferences) that could be abstracted during training. In other words, if you can imagine a hypothetical set containing every possible theory that could be established from the data, generalization involves the reduction of this set into a manageable number of important findings. In generalization, the learner is tasked with limiting the patterns it discovers to only those that will be most relevant to its future tasks. Generally, it is not feasible to reduce the number of patterns by examining them one-by-one and ranking them by future utility. Instead, machine learning algorithms generally employ shortcuts that reduce the search space more quickly. Toward this end, the algorithm will employ heuristics, which are educated guesses about where to find the most useful inferences. Because heuristics utilize approximations and other rules of thumb, they do not guarantee to find the single best model. However, without taking these shortcuts, finding useful information in a large dataset would be infeasible. Heuristics are routinely used by human beings to quickly generalize experience to new scenarios. If you have ever utilized your gut instinct to make a snap decision prior to fully evaluating your circumstances, you were intuitively using mental heuristics. The incredible human ability to make quick decisions often relies not on computer-like logic, but rather on heuristics guided by emotions. Sometimes, this can result in illogical conclusions. For example, more people express fear of airline travel versus automobile travel, despite automobiles being statistically more dangerous. This can be explained by the availability heuristic, which is the tendency of people to estimate the likelihood of an event by how easily its examples can be recalled. Accidents involving air travel are highly publicized. Being traumatic events, they are likely to be recalled very easily, whereas car accidents barely warrant a mention in the newspaper. The folly of misapplied heuristics is not limited to human beings. The heuristics employed by machine learning algorithms also sometimes result in erroneous conclusions. The algorithm is said to have a bias if the conclusions are systematically erroneous, or wrong in a predictable manner. For example, suppose that a machine learning algorithm learned to identify faces by finding two dark circles representing eyes, positioned above a straight line indicating a mouth. The algorithm might then have trouble with, or be biased against, faces that do not conform to its model. Faces with glasses, turned at an angle, looking sideways, or with various skin tones might not be detected by the algorithm. Similarly, it could be biased toward faces with certain skin tones, face shapes, or other characteristics that do not conform to its understanding of the world.   In modern usage, the word bias has come to carry quite negative connotations. Various forms of media frequently claim to be free from bias, and claim to report the facts objectively, untainted by emotion. Still, consider for a moment the possibility that a little bias might be useful. Without a bit of arbitrariness, might it be a bit difficult to decide among several competing choices, each with distinct strengths and weaknesses? Indeed, some recent studies in the field of psychology have suggested that individuals born with damage to portions of the brain responsible for emotion are ineffectual in decision making, and might spend hours debating simple decisions such as what color shirt to wear or where to eat lunch. Paradoxically, bias is what blinds us from some information while also allowing us to utilize other information for action. It is how machine learning algorithms choose among the countless ways to understand a set of data. Evaluation Bias is a necessary evil associated with the abstraction and generalization processes inherent in any learning task. In order to drive action in the face of limitless possibility, each learner must be biased in a particular way. Consequently, each learner has its weaknesses and there is no single learning algorithm to rule them all. Therefore, the final step in the generalization process is to evaluate or measure the learner's success in spite of its biases and use this information to inform additional training if needed. Once you've had success with one machine learning technique, you might be tempted to apply it to everything. It is important to resist this temptation because no machine learning approach is the best for every circumstance. This fact is described by the No Free Lunch theorem, introduced by David Wolpert in 1996. For more information, visit: http://www.no-free-lunch.org. Generally, evaluation occurs after a model has been trained on an initial training dataset. Then, the model is evaluated on a new test dataset in order to judge how well its characterization of the training data generalizes to new, unseen data. It's worth noting that it is exceedingly rare for a model to perfectly generalize to every unforeseen case. In parts, models fail to perfectly generalize due to the problem of noise, a term that describes unexplained or unexplainable variations in data. Noisy data is caused by seemingly random events, such as: Measurement error due to imprecise sensors that sometimes add or subtract a bit from the readings Issues with human subjects, such as survey respondents reporting random answers to survey questions, in order to finish more quickly Data quality problems, including missing, null, truncated, incorrectly coded, or corrupted values Phenomena that are so complex or so little understood that they impact the data in ways that appear to be unsystematic Trying to model noise is the basis of a problem called overfitting. Because most noisy data is unexplainable by definition, attempting to explain the noise will result in erroneous conclusions that do not generalize well to new cases. Efforts to explain the noise will also typically result in more complex models that will miss the true pattern that the learner tries to identify. A model that seems to perform well during training, but does poorly during evaluation, is said to be overfitted to the training dataset, as it does not generalize well to the test dataset.   Solutions to the problem of overfitting are specific to particular machine learning approaches. For now, the important point is to be aware of the issue. How well the models are able to handle noisy data is an important source of distinction among them. Machine learning in practice So far, we've focused on how machine learning works in theory. To apply the learning process to real-world tasks, we'll use a five-step process. Regardless of the task at hand, any machine learning algorithm can be deployed by following these steps: Data collection: The data collection step involves gathering the learning material an algorithm will use to generate actionable knowledge. In most cases, the data will need to be combined into a single source like a text file, spreadsheet, or database. Data exploration and preparation: The quality of any machine learning project is based largely on the quality of its input data. Thus, it is important to learn more about the data and its nuances during a practice called data exploration. Additional work is required to prepare the data for the learning process. This involves fixing or cleaning so-called "messy" data, eliminating unnecessary data, and recoding the data to conform to the learner's expected inputs. Model training: By the time the data has been prepared for analysis, you are likely to have a sense of what you are capable of learning from the data. The specific machine learning task chosen will inform the selection of an appropriate algorithm, and the algorithm will represent the data in the form of a model. Model evaluation: Because each machine learning model results in a biased solution to the learning problem, it is important to evaluate how well the algorithm learns from its experience. Depending on the type of model used, you might be able to evaluate the accuracy of the model using a test dataset or you may need to develop measures of performance specific to the intended application. Model improvement: If better performance is needed, it becomes necessary to utilize more advanced strategies to augment the performance of the model. Sometimes, it may be necessary to switch to a different type of model altogether. You may need to supplement your data with additional data or perform additional preparatory work as in step two of this process. After these steps are completed, if the model appears to be performing well, it can be deployed for its intended task. As the case may be, you might utilize your model to provide score data for predictions (possibly in real time), for projections of financial data, to generate useful insight for marketing or research, or to automate tasks such as mail delivery or flying aircraft. The successes and failures of the deployed model might even provide additional data to train your next generation learner. Types of input data The practice of machine learning involves matching the characteristics of input data to the biases of the available approaches. Thus, before applying machine learning to real-world problems, it is important to understand the terminology that distinguishes among input datasets. The phrase unit of observation is used to describe the smallest entity with measured properties of interest for a study. Commonly, the unit of observation is in the form of persons, objects or things, transactions, time points, geographic regions, or measurements. Sometimes, units of observation are combined to form units such as person-years, which denote cases where the same person is tracked over multiple years; each person-year comprises of a person's data for one year. The unit of observation is related, but not identical, to the unit of analysis, which is the smallest unit from which the inference is made. Although it is often the case, the observed and analyzed units are not always the same. For example, data observed from people might be used to analyze trends across countries. Datasets that store the units of observation and their properties can be imagined as collections of data consisting of: Examples: Instances of the unit of observation for which properties have been recorded Features: Recorded properties or attributes of examples that may be useful for learning It is easiest to understand features and examples through real-world cases. To build a learning algorithm to identify spam e-mail, the unit of observation could be e-mail messages, the examples would be specific messages, and the features might consist of the words used in the messages. For a cancer detection algorithm, the unit of observation could be patients, the examples might include a random sample of cancer patients, and the features may be the genomic markers from biopsied cells as well as the characteristics of patient such as weight, height, or blood pressure. While examples and features do not have to be collected in any specific form, they are commonly gathered in matrix format, which means that each example has exactly the same features. The following spreadsheet shows a dataset in matrix format. In matrix data, each row in the spreadsheet is an example and each column is a feature. Here, the rows indicate examples of automobiles, while the columns record various each automobile's features, such as price, mileage, color, and transmission type. Matrix format data is by far the most common form used in machine learning: Features also come in various forms. If a feature represents a characteristic measured in numbers, it is unsurprisingly called numeric. Alternatively, if a feature is an attribute that consists of a set of categories, the feature is called categorical or nominal. A special case of categorical variables is called ordinal, which designates a nominal variable with categories falling in an ordered list. Some examples of ordinal variables include clothing sizes such as small, medium, and large; or a measurement of customer satisfaction on a scale from "not at all happy" to "very happy." It is important to consider what the features represent, as the type and number of features in your dataset will assist in determining an appropriate machine learning algorithm for your task. Types of machine learning algorithms Machine learning algorithms are divided into categories according to their purpose. Understanding the categories of learning algorithms is an essential first step towards using data to drive the desired action. A predictive model is used for tasks that involve, as the name implies, the prediction of one value using other values in the dataset. The learning algorithm attempts to discover and model the relationship between the target feature (the feature being predicted) and the other features. Despite the common use of the word "prediction" to imply forecasting, predictive models need not necessarily foresee events in the future. For instance, a predictive model could be used to predict past events, such as the date of a baby's conception using the mother's present-day hormone levels. Predictive models can also be used in real time to control traffic lights during rush hours. Because predictive models are given clear instruction on what they need to learn and how they are intended to learn it, the process of training a predictive model is known as supervised learning. The supervision does not refer to human involvement, but rather to the fact that the target values provide a way for the learner to know how well it has learned the desired task. Stated more formally, given a set of data, a supervised learning algorithm attempts to optimize a function (the model) to find the combination of feature values that result in the target output. The often used supervised machine learning task of predicting which category an example belongs to is known as classification. It is easy to think of potential uses for a classifier. For instance, you could predict whether: An e-mail message is spam A person has cancer A football team will win or lose An applicant will default on a loan In classification, the target feature to be predicted is a categorical feature known as the class, and is divided into categories called levels. A class can have two or more levels, and the levels may or may not be ordinal. Because classification is so widely used in machine learning, there are many types of classification algorithms, with strengths and weaknesses suited for different types of input data. Supervised learners can also be used to predict numeric data such as income, laboratory values, test scores, or counts of items. To predict such numeric values, a common form of numeric prediction fits linear regression models to the input data. Although regression models are not the only type of numeric models, they are, by far, the most widely used. Regression methods are widely used for forecasting, as they quantify in exact terms the association between inputs and the target, including both, the magnitude and uncertainty of the relationship. Since it is easy to convert numbers into categories (for example, ages 13 to 19 are teenagers) and categories into numbers (for example, assign 1 to all males, 0 to all females), the boundary between classification models and numeric prediction models is not necessarily firm. A descriptive model is used for tasks that would benefit from the insight gained from summarizing data in new and interesting ways. As opposed to predictive models that predict a target of interest, in a descriptive model, no single feature is more important than any other. In fact, because there is no target to learn, the process of training a descriptive model is called unsupervised learning. Although it can be more difficult to think of applications for descriptive models—after all, what good is a learner that isn't learning anything in particular—they are used quite regularly for data mining. For example, the descriptive modeling task called pattern discovery is used to identify useful associations within data. Pattern discovery is often used for market basket analysis on retailers' transactional purchase data. Here, the goal is to identify items that are frequently purchased together, such that the learned information can be used to refine marketing tactics. For instance, if a retailer learns that swimming trunks are commonly purchased at the same time as sunglasses, the retailer might reposition the items more closely in the store or run a promotion to "up-sell" customers on associated items. Originally used only in retail contexts, pattern discovery is now starting to be used in quite innovative ways. For instance, it can be used to detect patterns of fraudulent behavior, screen for genetic defects, or identify hot spots for criminal activity. The descriptive modeling task of dividing a dataset into homogeneous groups is called clustering. This is sometimes used for segmentation analysis that identifies groups of individuals with similar behavior or demographic information, so that advertising campaigns could be tailored for particular audiences. Although the machine is capable of identifying the clusters, human intervention is required to interpret them. For example, given five different clusters of shoppers at a grocery store, the marketing team will need to understand the differences among the groups in order to create a promotion that best suits each group. Lastly, a class of machine learning algorithms known as meta-learners is not tied to a specific learning task, but is rather focused on learning how to learn more effectively. A meta-learning algorithm uses the result of some learnings to inform additional learning. This can be beneficial for very challenging problems or when a predictive algorithm's performance needs to be as accurate as possible. Machine learning with R Many of the algorithms needed for machine learning with R are not included as part of the base installation. Instead, the algorithms needed for machine learning are available via a large community of experts who have shared their work freely. These must be installed on top of base R manually. Thanks to R's status as free open source software, there is no additional charge for this functionality. A collection of R functions that can be shared among users is called a package. Free packages exist for each of the machine learning algorithms covered in this book. In fact, this book only covers a small portion of all of R's machine learning packages. If you are interested in the breadth of R packages, you can view a list at Comprehensive R Archive Network (CRAN), a collection of web and FTP sites located around the world to provide the most up-to-date versions of R software and packages. If you obtained the R software via download, it was most likely from CRAN at http://cran.r-project.org/index.html. If you do not already have R, the CRAN website also provides installation instructions and information on where to find help if you have trouble. The Packages link on the left side of the page will take you to a page where you can browse packages in an alphabetical order or sorted by the publication date. At the time of writing this, a total 6,779 packages were available—a jump of over 60 percent in the time since the first edition was written, and this trend shows no sign of slowing! The Task Views link on the left side of the CRAN page provides a curated list of packages as per the subject area. The task view for machine learning, which lists the packages covered in this book (and many more), is available at http://cran.r-project.org/web/views/MachineLearning.html. Installing R packages Despite the vast set of available R add-ons, the package format makes installation and use a virtually effortless process. To demonstrate the use of packages, we will install and load the RWeka package, which was developed by Kurt Hornik, Christian Buchta, and Achim Zeileis (see Open-Source Machine Learning: R Meets Weka in Computational Statistics 24: 225-232 for more information). The RWeka package provides a collection of functions that give R access to the machine learning algorithms in the Java-based Weka software package by Ian H. Witten and Eibe Frank. More information on Weka is available at http://www.cs.waikato.ac.nz/~ml/weka/ To use the RWeka package, you will need to have Java installed (many computers come with Java preinstalled). Java is a set of programming tools available for free, which allow for the use of cross-platform applications such as Weka. For more information, and to download Java on your system, you can visit http://java.com. The most direct way to install a package is via the install.packages() function. To install the RWeka package, at the R command prompt, simply type: > install.packages("RWeka") R will then connect to CRAN and download the package in the correct format for your OS. Some packages such as RWeka require additional packages to be installed before they can be used (these are called dependencies). By default, the installer will automatically download and install any dependencies. The first time you install a package, R may ask you to choose a CRAN mirror. If this happens, choose the mirror residing at a location close to you. This will generally provide the fastest download speed. The default installation options are appropriate for most systems. However, in some cases, you may want to install a package to another location. For example, if you do not have root or administrator privileges on your system, you may need to specify an alternative installation path. This can be accomplished using the lib option, as follows: > install.packages("RWeka", lib="/path/to/library") The installation function also provides additional options for installation from a local file, installation from source, or using experimental versions. You can read about these options in the help file, by using the following command: > ?install.packages More generally, the question mark operator can be used to obtain help on any R function. Simply type ? before the name of the function. Loading and unloading R packages In order to conserve memory, R does not load every installed package by default. Instead, packages are loaded by users as they are needed, using the library() function. The name of this function leads some people to incorrectly use the terms library and package interchangeably. However, to be precise, a library refers to the location where packages are installed and never to a package itself. To load the RWeka package we installed previously, you can type the following: > library(RWeka) To unload an R package, use the detach() function. For example, to unload the RWeka package shown previously use the following command: > detach("package:RWeka", unload = TRUE) This will free up any resources used by the package. Summary To learn more about Machine Learning, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Machine Learning with R - Second Edition Machine Learning with R Cookbook  Mastering Machine Learning with R  Learning Data Mining with R Resources for Article: Further resources on this subject: Getting Started with RStudio [article] Machine learning and Python – the Dream Team [article] Deep learning in R [article]
Read more
  • 0
  • 0
  • 7019

article-image-r-vs-pandas
Packt
16 Feb 2016
1 min read
Save for later

R vs Pandas

Packt
16 Feb 2016
1 min read
This article focuses on comparing pandas with R, the statistical package on which much of pandas' functionality is modeled. It is intended as a guide for R users who wish to use pandas, and for users who wish to replicate functionality that they have seen in the R code in pandas. It focuses on some key features available to R users and shows how to achieve similar functionality in pandas by using some illustrative examples. This article assumes that you have the R statistical package installed. If not, it can be downloaded and installed from here: http://www.r-project.org/. By the end of the article, data analysis users should have a good grasp of the data analysis capabilities of R as compared to pandas, enabling them to transition to or use pandas, should they need to. The various topics addressed in this article include the following: R data types and their pandas equivalents Slicing and selection Arithmetic operations on datatype columns Aggregation and GroupBy Matching Split-apply-combine Melting and reshaping Factors and categorical data
Read more
  • 0
  • 0
  • 3164

article-image-data-mining
Packt
16 Feb 2016
11 min read
Save for later

Data mining

Packt
16 Feb 2016
11 min read
Let's talk about data mining. What is data mining? Data mining is the discovery of a model in data; it's also called exploratory data analysis, and discovers useful, valid, unexpected, and understandable knowledge from the data. Some goals are shared with other sciences, such as statistics, artificial intelligence, machine learning, and pattern recognition. Data mining has been frequently treated as an algorithmic problem in most cases. Clustering, classification, association rule learning, anomaly detection, regression, and summarization are all part of the tasks belonging to data mining. (For more resources related to this topic, see here.) The data mining methods can be summarized into two main categories of data mining problems: feature extraction and summarization. Feature extraction This is to extract the most prominent features of the data and ignore the rest. Here are some examples: Frequent itemsets: This model makes sense for data that consists of baskets of small sets of items. Similar items: Sometimes your data looks like a collection of sets and the objective is to find pairs of sets that have a relatively large fraction of their elements in common. It's a fundamental problem of data mining. Summarization The target is to summarize the dataset succinctly and approximately, such as clustering, which is the process of examining a collection of points (data) and grouping the points into clusters according to some measure. The goal is that points in the same cluster have a small distance from one another, while points in different clusters are at a large distance from one another. The data mining process There are two popular processes to define the data mining process in different perspectives, and the more widely adopted one is CRISP-DM: Cross-Industry Standard Process for Data Mining(CRISP-DM) Sample, Explore, Modify, Model, Assess (SEMMA), which was developed by the SAS Institute, USA CRISP-DM There are six phases in this process that are shown in the following figure; it is not rigid, but often has a great deal of backtracking: Let's look at the phases in detail: Business understanding: This task includes determining business objectives, assessing the current situation, establishing data mining goals, and developing a plan. Data understanding: This task evaluates data requirements and includes initial data collection, data description, data exploration, and the verification of data quality. Data preparation: Once available, data resources are identified in the last step. Then, the data needs to be selected, cleaned, and then built into the desired form and format. Modeling: Visualization and cluster analysis are useful for initial analysis. The initial association rules can be developed by applying tools such as generalized rule induction. This is a data mining technique to discover knowledge represented as rules to illustrate the data in the view of causal relationship between conditional factors and a given decision/outcome. The models appropriate to the data type can also be applied. Evaluation :The results should be evaluated in the context specified by the business objectives in the first step. This leads to the identification of new needs and in turn reverts to the prior phases in most cases. Deployment: Data mining can be used to both verify previously held hypotheses or for knowledge. SEMMA Here is an overview of the process for SEMMA: Let's look at these processes in detail: Sample: In this step, a portion of a large dataset is extracted Explore: To gain a better understanding of the dataset, unanticipated trends and anomalies are searched in this step Modify: The variables are created, selected, and transformed to focus on the model construction process Model: A variable combination of models is searched to predict a desired outcome Assess: The findings from the data mining process are evaluated by its usefulness and reliability Social network mining As we mentioned before, data mining finds a model on data and the mining of social network finds the model on graph data in which the social network is represented. Social network mining is one application of web data mining; the popular applications are social sciences and bibliometry, PageRank and HITS, shortcomings of the coarse-grained graph model, enhanced models and techniques, evaluation of topic distillation, and measuring and modeling the Web. Social network When it comes to the discussion of social networks, you will think of Facebook, Google+, LinkedIn, and so on. The essential characteristics of a social network are as follows: There is a collection of entities that participate in the network. Typically, these entities are people, but they could be something else entirely. There is at least one relationship between the entities of the network. On Facebook, this relationship is called friends. Sometimes, the relationship is all-or-nothing; two people are either friends or they are not. However, in other examples of social networks, the relationship has a degree. This degree could be discrete, for example, friends, family, acquaintances, or none as in Google+. It could be a real number; an example would be the fraction of the average day that two people spend talking to each other. There is an assumption of nonrandomness or locality. This condition is the hardest to formalize, but the intuition is that relationships tend to cluster. That is, if entity A is related to both B and C, then there is a higher probability than average that B and C are related. Here are some varieties of social networks: Telephone networks: The nodes in this network are phone numbers and represent individuals E-mail networks: The nodes represent e-mail addresses, which represent individuals Collaboration networks: The nodes here represent individuals who published research papers; the edge connecting two nodes represent two individuals who published one or more papers jointly Social networks are modeled as undirected graphs. The entities are the nodes, and an edge connects two nodes if the nodes are related by the relationship that characterizes the network. If there is a degree associated with the relationship, this degree is represented by labeling the edges. Here is an example in which Coleman's High School Friendship Data from the sna R package is used for analysis. The data is from a research on friendship ties between 73 boys in a high school in one chosen academic year; reported ties for all informants are provided for two time points (fall and spring). The dataset's name is coleman, which is an array type in R language. The node denotes a specific student and the line represents the tie between two students. Text mining Text mining is based on the data of text, concerned with exacting relevant information from large natural language text, and searching for interesting relationships, syntactical correlation, or semantic association between the extracted entities or terms. It is also defined as automatic or semiautomatic processing of text. The related algorithms include text clustering, text classification, natural language processing, and web mining. One of the characteristics of text mining is text mixed with numbers, or in other point of view, the hybrid data type contained in the source dataset. The text is usually a collection of unstructured documents, which will be preprocessed and transformed into a numerical and structured representation. After the transformation, most of the data mining algorithms can be applied with good effects. The process of text mining is described as follows: Text mining starts from preparing the text corpus, which are reports, letters and so forth The second step is to build a semistructured text database that is based on the text corpus The third step is to build a term-document matrix in which the term frequency is included The final result is further analysis, such as text analysis, semantic analysis, information retrieval, and information summarization Information retrieval and text mining Information retrieval is to help users find information, most commonly associated with online documents. It focuses on the acquisition, organization, storage, retrieval, and distribution for information. The task of Information Retrieval (IR) is to retrieve relevant documents in response to a query. The fundamental technique of IR is measuring similarity. Key steps in IR are as follows: Specify a query. The following are some of the types of queries: Keyword query: This is expressed by a list of keywords to find documents that contain at least one keyword Boolean query: This is constructed with Boolean operators and keywords Phrase query: This is a query that consists of a sequence of words that makes up a phrase Proximity query: This is a downgrade version of the phrase queries and can be a combination of keywords and phrases Full document query: This query is a full document to find other documents similar to the query document Natural language questions: This query helps to express users' requirements as a natural language question Search the document collection. Return the subset of relevant documents. Mining text for prediction Prediction of results from text is just as ambitious as predicting numerical data mining and has similar problems associated with numerical classification. It is generally a classification issue. Prediction from text needs prior experience, from the sample, to learn how to draw a prediction on new documents. Once text is transformed into numeric data, prediction methods can be applied. Web data mining Web mining aims to discover useful information or knowledge from the web hyperlink structure, page, and usage data. The Web is one of the biggest data sources to serve as the input for data mining applications. Web data mining is based on IR, machine learning (ML), statistics, pattern recognition, and data mining. Web mining is not purely a data mining problem because of the heterogeneous and semistructured or unstructured web data, although many data mining approaches can be applied to it. Web mining tasks can be defined into at least three types: Web structure mining: This helps to find useful information or valuable structural summary about sites and pages from hyperlinks Web content mining: This helps to mine useful information from web page contents Web usage mining: This helps to discover user access patterns from web logs to detect intrusion, fraud, and attempted break-in The algorithms applied to web data mining are originated from classical data mining algorithms; it shares many similarities, such as the mining process, however, differences exist too. The characteristics of web data mining makes it different from data mining for the following reasons: The data is unstructured The information of the Web keeps changing and the amount of data keeps growing Any data type is available on the Web, such as structured and unstructured data Heterogeneous information is on the web; redundant pages are present too Vast amounts of information on the web is linked The data is noisy Web data mining differentiates from data mining by the huge dynamic volume of source dataset, a big variety of data format, and so on. The most popular data mining tasks related to the Web are as follows: Information extraction (IE):The task of IE consists of a couple of steps, tokenization, sentence segmentation, part-of-speech assignment, named entity identification, phrasal parsing, sentential parsing, semantic interpretation, discourse interpretation, template filling, and merging. Natural language processing (NLP): This researches the linguistic characteristics of human-human and human-machine interactive, models of linguistic competence and performance, frameworks to implement process with such models, processes'/models' iterative refinement, and evaluation techniques for the result systems. Classical NLP tasks related to web data mining are tagging, knowledge representation, ontologies, and so on. Question answering: The goal is to find the answer from a collection of text to questions in natural language format. It can be categorized into slot filling, limited domain, and open domain with bigger difficulties for the latter. One simple example is based on a predefined FAQ to answer queries from customers. Resource discovery: The popular applications are collecting important pages preferentially; similarity search using link topology, topical locality and focused crawling; and discovering communities. Summary We have looked at the broad aspects of data mining here. In case you are wondering what to look at next, check out how to "data mine" in R with Learning Data Mining with R (https://www.packtpub.com/big-data-and-business-intelligence/learning-data-mining-r). If R is not your taste, you can "data mine" with Python as well. Check out Learning Data Mining with Python (https://www.packtpub.com/big-data-and-business-intelligence/learning-data-mining-python). Resources for Article: Further resources on this subject: Machine Learning with R [Article] Machine learning and Python – the Dream Team [Article] Machine Learning in Bioinformatics [Article]
Read more
  • 0
  • 0
  • 4737
Visually different images

article-image-training-and-visualizing-a-neural-network-with-r
Oli Huggins
16 Feb 2016
8 min read
Save for later

Training and Visualizing a neural network with R

Oli Huggins
16 Feb 2016
8 min read
The development of a neural network is inspired by human brain activities. As such, this type of network is a computational model that mimics the pattern of the human mind. In contrast to this, support vector machines first, map input data into a high dimensional feature space defined by the kernel function, and find the optimum hyperplane that separates the training data by the maximum margin. In short, we can think of support vector machines as a linear algorithm in a high dimensional space. In this article, we will cover: Training a neural network with neuralnet Visualizing a neural network trained by neuralnet (For more resources related to this topic, see here.) Training a neural network with neuralnet The neural network is constructed with an interconnected group of nodes, which involves the input, connected weights, processing element, and output. Neural networks can be applied to many areas, such as classification, clustering, and prediction. To train a neural network in R, you can use neuralnet, which is built to train multilayer perceptron in the context of regression analysis, and contains many flexible functions to train forward neural networks. In this recipe, we will introduce how to use neuralnet to train a neural network. Getting ready In this recipe, we will use an iris dataset as our example dataset. We will first split the irisdataset into a training and testing datasets, respectively. How to do it... Perform the following steps to train a neural network with neuralnet: First load the iris dataset and split the data into training and testing datasets: > data(iris) > ind <- sample(2, nrow(iris), replace = TRUE, prob=c(0.7, 0.3)) > trainset = iris[ind == 1,]> testset = iris[ind == 2,] Then, install and load the neuralnet package: > install.packages("neuralnet")> library(neuralnet) Add the columns versicolor, setosa, and virginica based on the name matched value in the Species column: > trainset$setosa = trainset$Species == "setosa" > trainset$virginica = trainset$Species == "virginica" > trainset$versicolor = trainset$Species == "versicolor" Next, train the neural network with the neuralnet function with three hidden neurons in each layer. Notice that the results may vary with each training, so you might not get the same result: > network = neuralnet(versicolor + virginica + setosa~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, trainset, hidden=3) > network Call: neuralnet(formula = versicolor + virginica + setosa ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = trainset, hidden = 3) 1 repetition was calculated. Error Reached Threshold Steps 1 0.8156100175 0.009994274769 11063 Now, you can view the summary information by accessing the result.matrix attribute of the built neural network model: > network$result.matrix 1 error 0.815610017474 reached.threshold 0.009994274769 steps 11063.000000000000 Intercept.to.1layhid1 1.686593311644 Sepal.Length.to.1layhid1 0.947415215237 Sepal.Width.to.1layhid1 -7.220058260187 Petal.Length.to.1layhid1 1.790333443486 Petal.Width.to.1layhid1 9.943109233330 Intercept.to.1layhid2 1.411026063895 Sepal.Length.to.1layhid2 0.240309549505 Sepal.Width.to.1layhid2 0.480654059973 Petal.Length.to.1layhid2 2.221435192437 Petal.Width.to.1layhid2 0.154879347818 Intercept.to.1layhid3 24.399329878242 Sepal.Length.to.1layhid3 3.313958088512 Sepal.Width.to.1layhid3 5.845670010464 Petal.Length.to.1layhid3 -6.337082722485 Petal.Width.to.1layhid3 -17.990352566695 Intercept.to.versicolor -1.959842102421 1layhid.1.to.versicolor 1.010292389835 1layhid.2.to.versicolor 0.936519720978 1layhid.3.to.versicolor 1.023305801833 Intercept.to.virginica -0.908909982893 1layhid.1.to.virginica -0.009904635231 1layhid.2.to.virginica 1.931747950462 1layhid.3.to.virginica -1.021438938226 Intercept.to.setosa 1.500533827729 1layhid.1.to.setosa -1.001683936613 1layhid.2.to.setosa -0.498758815934 1layhid.3.to.setosa -0.001881935696 Lastly, you can view the generalized weight by accessing it in the network: > head(network$generalized.weights[[1]]) How it works... The neural network is a network made up of artificial neurons (or nodes). There are three types of neurons within the network: input neurons, hidden neurons, and output neurons. In the network, neurons are connected; the connection strength between neurons is called weights. If the weight is greater than zero, it is in an excitation status. Otherwise, it is in an inhibition status. Input neurons receive the input information; the higher the input value, the greater the activation. Then, the activation value is passed through the network in regard to weights and transfer functions in the graph. The hidden neurons (or output neurons) then sum up the activation values and modify the summed values with the transfer function. The activation value then flows through hidden neurons and stops when it reaches the output nodes. As a result, one can use the output value from the output neurons to classify the data. Artificial Neural Network The advantages of a neural network are: firstly, it can detect a nonlinear relationship between the dependent and independent variable. Secondly, one can efficiently train large datasets using the parallel architecture. Thirdly, it is a nonparametric model so that one can eliminate errors in the estimation of parameters. The main disadvantages of neural network are that it often converges to the local minimum rather than the global minimum. Also, it might over-fit when the training process goes on for too long. In this recipe, we demonstrate how to train a neural network. First, we split the iris dataset into training and testing datasets, and then install the neuralnet package and load the library into an R session. Next, we add the columns versicolor, setosa, and virginica based on the name matched value in the Species column, respectively. We then use the neuralnet function to train the network model. Besides specifying the label (the column where the name equals to versicolor, virginica, and setosa) and training attributes in the function, we also configure the number of hidden neurons (vertices) as three in each layer. Then, we examine the basic information about the training process and the trained network saved in the network. From the output message, it shows the training process needed 11,063 steps until all the absolute partial derivatives of the error function were lower than 0.01 (specified in the threshold). The error refers to the likelihood of calculating Akaike Information Criterion (AIC). To see detailed information on this, you can access the result.matrix of the built neural network to see the estimated weight. The output reveals that the estimated weight ranges from -18 to 24.40; the intercepts of the first hidden layer are 1.69, 1.41 and 24.40, and the two weights leading to the first hidden neuron are estimated as 0.95 (Sepal.Length), -7.22 (Sepal.Width), 1.79 (Petal.Length), and 9.94 (Petal.Width). We can lastly determine that the trained neural network information includes generalized weights, which express the effect of each covariate. In this recipe, the model generates 12 generalized weights, which are the combination of four covariates (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) to three responses (setosa, virginica, versicolor). See also For a more detailed introduction on neuralnet, one can refer to the following paper: Günther, F., and Fritsch, S. (2010). neuralnet: Training of neural networks. The R journal, 2(1), 30-38. Visualizing a neural network trained by neuralnet The package, neuralnet, provides the plot function to visualize a built neural network and the gwplot function to visualize generalized weights. In following recipe, we will cover how to use these two functions. Getting ready You need to have completed the previous recipe by training a neural network and have all basic information saved in network. How to do it... Perform the following steps to visualize the neural network and the generalized weights: You can visualize the trained neural network with the plot function: > plot(network) Figure 10: The plot of trained neural network Furthermore, You can use gwplot to visualize the generalized weights: > par(mfrow=c(2,2)) > gwplot(network,selected.covariate="Petal.Width") > gwplot(network,selected.covariate="Sepal.Width") > gwplot(network,selected.covariate="Petal.Length") > gwplot(network,selected.covariate="Petal.Width") Figure 11: The plot of generalized weights How it works... In this recipe, we demonstrate how to visualize the trained neural network and the generalized weights of each trained attribute. Also, the plot includes the estimated weight, intercepts and basic information about the training process. At the bottom of the figure, one can find the overall error and number of steps required to converge. If all the generalized weights are close to zero on the plot, it means the covariate has little effect. However, if the overall variance is greater than one, it means the covariate has a nonlinear effect. See also For more information about gwplot, one can use the help function to access the following document: > ?gwplot Summary To learn more about machine learning with R, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Machine Learning with R (Second Edition) - Read Online Mastering Machine Learning with R - Read Online Resources for Article: Further resources on this subject: Introduction to Machine Learning with R [article] Hive Security [article] Spark - Architecture and First Program [article]
Read more
  • 0
  • 0
  • 7690

article-image-introduction-scikit-learn
Packt
16 Feb 2016
5 min read
Save for later

Introduction to Scikit-Learn

Packt
16 Feb 2016
5 min read
Introduction Since its release in 2007, scikit-learn has become one of the most popular open source machine learning libraries for Python. scikit-learn provides algorithms for machine learning tasks including classification, regression, dimensionality reduction, and clustering. It also provides modules for extracting features, processing data, and evaluating models. (For more resources related to this topic, see here.) Conceived as an extension to the SciPy library, scikit-learn is built on the popular Python libraries NumPy and matplotlib. NumPy extends Python to support efficient operations on large arrays and multidimensional matrices. matplotlib provides visualization tools, and SciPy provides modules for scientific computing. scikit-learn is popular for academic research because it has a well-documented, easy-to-use, and versatile API. Developers can use scikit-learn to experiment with different algorithms by changing only a few lines of the code. scikit-learn wraps some popular implementations of machine learning algorithms, such as LIBSVM and LIBLINEAR. Other Python libraries, including NLTK, include wrappers for scikit-learn. scikit-learn also includes a variety of datasets, allowing developers to focus on algorithms rather than obtaining and cleaning data. Licensed under the permissive BSD license, scikit-learn can be used in commercial applications without restrictions. Many of scikit-learn's algorithms are fast and scalable to all but massive datasets. Finally, scikit-learn is noted for its reliability; much of the library is covered by automated tests. Installing scikit-learn This book is written for version 0.15.1 of scikit-learn; use this version to ensure that the examples run correctly. If you have previously installed scikit-learn, you can retrieve the version number with the following code: >>> import sklearn >>> sklearn.__version__ '0.15.1' If you have not previously installed scikit-learn, you can install it from a package manager or build it from the source. We will review the installation processes for Linux, OS X, and Windows in the following sections, but refer to http://scikit-learn.org/stable/install.html for the latest instructions. The following instructions only assume that you have installed Python 2.6, Python 2.7, or Python 3.2 or newer. Go to http://www.python.org/download/ for instructions on how to install Python. Installing scikit-learn on Windows scikit-learn requires Setuptools, a third-party package that supports packaging and installing software for Python. Setuptools can be installed on Windows by running the bootstrap script at https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py. Windows binaries for the 32- and 64-bit versions of scikit-learn are also available. If you cannot determine which version you need, install the 32-bit version. Both versions depend on NumPy 1.3 or newer. The 32-bit version of NumPy can be downloaded from http://sourceforge.net/projects/numpy/files/NumPy/. The 64-bit version can be downloaded from http://www.lfd.uci.edu/~gohlke/pythonlibs/#scikit-learn. A Windows installer for the 32-bit version of scikit-learn can be downloaded from http://sourceforge.net/projects/scikit-learn/files/. An installer for the 64-bit version of scikit-learn can be downloaded from http://www.lfd.uci.edu/~gohlke/pythonlibs/#scikit-learn. scikit-learn can also be built from the source code on Windows. Building requires a C/C++ compiler such as MinGW (http://www.mingw.org/), NumPy, SciPy, and Setuptools. To build, clone the Git repository from https://github.com/scikit-learn/scikit-learn and execute the following command: python setup.py install Installing scikit-learn on Linux There are several options to install scikit-learn on Linux, depending on your distribution. The preferred option to install scikit-learn on Linux is to use pip. You may also install it using a package manager, or build scikit-learn from its source. To install scikit-learn using pip, execute the following command: sudo pip install scikit-learn To build scikit-learn, clone the Git repository from https://github.com/scikit-learn/scikit-learn. Then install the following dependencies: sudo apt-get install python-dev python-numpy python-numpy-dev python-setuptools python-numpy-dev python-scipy libatlas-dev g++ Navigate to the repository's directory and execute the following command: python setup.py install Installing scikit-learn on OS X scikit-learn can be installed on OS X using Macports: sudo port install py26-sklearn If Python 2.7 is installed, run the following command: sudo port install py27-sklearn scikit-learn can also be installed using pip with the following command: pip install scikit-learn Verifying the installation To verify that scikit-learn has been installed correctly, open a Python console and execute the following: >>> import sklearn >>> sklearn.__version__ '0.15.1' To run scikit-learn's unit tests, first install the nose library. Then execute the following: nosetest sklearn –exe Congratulations! You've successfully installed scikit-learn. Summary In this article, we had a brief introduction of Scikit. We also covered the installation of Scikit on various operating system Windows, Linux, OS X. You can also refer the following books on the similar topics: scikit-learn Cookbook (https://www.packtpub.com/big-data-and-business-intelligence/scikit-learn-cookbook) Learning scikit-learn: Machine Learning in Python (https://www.packtpub.com/big-data-and-business-intelligence/learning-scikit-learn-machine-learning-python) Resources for Article: Further resources on this subject: Our First Machine Learning Method – Linear Classification[article] Specialized Machine Learning Topics[article] Machine Learning[article]
Read more
  • 0
  • 0
  • 2711

article-image-deep-learning-r
Packt
15 Feb 2016
12 min read
Save for later

Deep learning in R

Packt
15 Feb 2016
12 min read
As the title suggests, in this article, we will be taking a look at some of the deep learning models in R. Some of the pioneering advancements in neural networks research in the last decade have opened up a new frontier in machine learning that is generally called by the name deep learning. The general definition of deep learning is, a class of machine learning techniques, where many layers of information processing stages in hierarchical supervised architectures are exploited for unsupervised feature learning and for pattern analysis/classification. The essence of deep learning is to compute hierarchical features or representations of the observational data, where the higher-level features or factors are defined from lower-level ones. Although there are many similar definitions and architectures for deep learning, two common elements in all of them are: multiple layers of nonlinear information processing and supervised or unsupervised learning of feature representations at each layer from the features learned at the previous layer. The initial works on deep learning were based on multilayer neural network models. Recently, many other forms of models are also used such as deep kernel machines and deep Q-networks. Researchers have experimented with multilayer neural networks even in previous decades. However, two reasons were limiting any progress with learning using such architectures. The first reason is that the learning of parameters of the network is a nonconvex optimization problem and often one gets stuck at poor local minima's starting from random initial conditions. The second reason is that the associated computational requirements were huge. A breakthrough for the first problem came when Geoffrey Hinton developed a fast algorithm for learning a special class of neural networks called deep belief nets (DBN). We will describe DBNs in more detail in the later sections. The high computational power requirements were met with the advancement in computing using general purpose graphical processing units (GPGPUs). What made deep learning so popular for practical applications is the significant improvement in accuracy achieved in automatic speech recognition and computer vision. For example, the word error rate in automatic speech recognition of a switchboard conversational speech had reached a saturation of around 40% after years of research. However, using deep learning, the word error rate was reduced dramatically to close to 10% in a matter of a few years. Another well-known example is how deep convolution neural network achieved the least error rate of 15.3% in the 2012 ImageNet Large Scale Visual Recognition Challenge compared to state-of-the-art methods that gave 26.2% as the least error rate. In this article, we will describe one class of deep learning models called deep belief networks. Interested readers are requested to read the book by Li Deng and Dong Yu for a detailed understanding of various methods and applications of deep learning. We will also illustrate the use of DBN with the R package darch. Restricted Boltzmann machines A restricted Boltzmann machine (RBM) is a two-layer network (bi-partite graph), in which one layer is a visible layer (v) and the second layer is a hidden layer (h). All nodes in the visible layer and all nodes in the hidden layer are connected by undirected edges, and there no connections between nodes in the same layer: An RBM is characterized by the joint distribution of states of all visible units v={V1,V2,...,VM}and states of all hidden units h={h1,h2,...,hN} given by: Here, E(v,h|Ɵ) is called the energy function  Z=ƩvƩhexp(-E(v,h|Ɵ) and is the normalization constant known by the name partition function from Statistical Physics nomenclature. There are mainly two types of RBMs. In the first one, both v and h are Bernoulli random variables. In the second type, h is a Bernoulli random variable whereas v is a Gaussian random variable. For Bernoulli RBM, the energy function is given by: Here, Wij represents the weight of the edge between nodes Vi and hj; bi and aj are bias parameters for the visible and hidden layers, respectively. For this energy function, the exact expressions for the conditional probability can be derived as follows: Here, is the logistic function 1/(1+exp(-x)). If the input variables are continuous, one can use the Gaussian RBM; the energy function of it is given by: Also, in this case, the conditional probabilities of vi and hj will become as follows: This is a normal distribution with mean ƩMI=1Wijhj+bi and variance 1. Now that we have described the basic architecture of an RBM, how is it that it is trained? If we try to use the standard approach of taking the gradient of log-likelihood, we get the following update rule: Here, IEdata(vi,hj) is the expectation of vi,hj computed using IEmodel(vi,hj) the dataset and is the same expectation computed using the model. However, one cannot use this exact expression for updating weights because IEmodel(vi,hj) is difficult to compute. The first breakthrough came to solve this problem and, hence, to train deep neural networks, when Hinton and team proposed an algorithm called Contrastive Divergence (CD). The essence of the algorithm is described in the next paragraph. The idea is to approximate IEmodel(vi,hj) by using values of vi and hj generated using Gibbs sampling from the conditional distributions mentioned previously. One scheme of doing this is as follows: Initialize Vt=0 from the dataset. Find ht=0 by sampling from the conditional distribution ht=0 ~ p(h|vt=0). Find Vt=1 by sampling from the conditional distribution vt=1 ~ p(v|ht=0). Find ht=1 by sampling from the conditional distribution ht=1 ~ p(h|vt=1). Once we find values of Vt=1 and ht=1 , use (vit=1hjt=1) which is the product of ith component of Vt=1 and jth component of ht=1, as an approximation for IEmodel(vi,hj). This is called CD-1 algorithm. One can generalize this to use the values from the kth step of Gibbs sampling and it is known as CD-k algorithm. One can easily see the connection between RBMs and Bayesian inference. Since the CD algorithm is like a posterior density estimate, one could say that RBMs are trained using a Bayesian inference approach. Although the Contrastive Divergence algorithm looks simple, one needs to be very careful in training RBMs, otherwise the model can result in overfitting. Readers who are interested in using RBMs in practical applications should refer to the technical report where this is discussed in detail. Deep belief networks One can stack several RBMs, one on top of each other, such that the values of hidden units in the layer n-1(hi,n-1) would become values of visible units in the nth layer (vi,n), and so on. The resulting network is called a deep belief network. It was one of the main architectures used in early deep learning networks for pretraining. The idea of pretraining a NN is the following: in the standard three-layer (input-hidden-output) NN, one can start with random initial values for the weights and using the backpropagation algorithm can find a good minimum of the log-likelihood function. However, when the number of layers increases, the straightforward application of backpropagation does not work because starting from output layer, as we compute the gradient values for the layers deep inside, their magnitude becomes very small. This is called the gradient vanishing problem. As a result, the network will get trapped in some poor local minima. Backpropagation still works if we are starting from the neighborhood of a good minimum. To achieve this, a DNN is often pretrained in an unsupervised way using a DBN. Instead of starting from random values of weights, first train a DBN in an unsupervised way and use weights from the DBN as initial weights for a corresponding supervised DNN. It was seen that such DNNs pretrained using DBNs perform much better. The layer-wise pretraining of a DBN proceeds as follows. Start with the first RBM and train it using input data in the visible layer and the CD algorithm (or its latest better variants). Then, stack a second RBM on top of this. For this RBM, use values sample from as the values for the visible layer. Continue this process for the desired number of layers. The outputs of hidden units from the top layer can also be used as inputs for training a supervised model. For this, add a conventional NN layer at the top of DBN with the desired number of classes as the number of output nodes. Input for this NN would be the output from the top layer of DBN. This is called DBN-DNN architecture. Here, a DBN's role is generating highly efficient features (the output of the top layer of DBN) automatically from the input data for the supervised NN in the top layer. The architecture of a five-layer DBN-DNN for a binary classification task is shown in the following figure: The last layer is trained using the backpropagation algorithm in a supervised manner for the two classes c1 and c2 . We will illustrate the training and classification with such a DBN-DNN using the darch R package. The darch R package The darch package, written by Martin Drees, is one of the R packages by which one can begin doing deep learning in R. It implements the DBN described in the previous section. The package can be downloaded from https://cran.r-project.org/web/packages/darch/index.html. The main class in the darch package implements deep architectures and provides the ability to train them with Contrastive Divergence and fine-tune with backpropagation, resilient backpropagation, and conjugate gradients. The new instances of the class are created with the newDArch constructor. It is called with the following arguments: a vector containing the number of nodes in each layers, the batch size, a Boolean variable to indicate whether to use the ff package for computing weights and outputs, and the name of the function for generating the weight matrices. Let us create a network having two input units, four hidden units, and one output unit: install.packages("darch") #one time >library(darch) >darch ← newDArch(c(2,4,1),batchSize = 2,genWeightFunc = generateWeights) INFO [2015-07-19 18:50:29] Constructing a darch with 3 layers. INFO [2015-07-19 18:50:29] Generating RBMs. INFO [2015-07-19 18:50:29] Construct new RBM instance with 2 visible and 4 hidden units. INFO [2015-07-19 18:50:29] Construct new RBM instance with 4 visible and 1 hidden units. Let us train the DBN with a toy dataset. We are using this because for training any realistic examples, it would take a long time, hours if not days. Let us create an input data set containing two columns and four rows: >inputs ← matrix(c(0,0,0,1,1,0,1,1),ncol=2,byrow=TRUE) >outputs ← matrix(c(0,1,1,0),nrow=4) Now, let us pretrain the DBN using the input data: >darch ← preTrainDArch(darch,inputs,maxEpoch=1000) We can have a look at the weights learned at any layer using the getLayerWeights( ) function. Let us see how the hidden layer looks like: >getLayerWeights(darch,index=1) [[1]] [,1] [,2] [,3] [,4] [1,] 8.167022 0.4874743 -7.563470 -6.951426 [2,] 2.024671 -10.7012389 1.313231 1.070006 [3,] -5.391781 5.5878931 3.254914 3.000914 Now, let's do a backpropagation for supervised learning. For this, we need to first set the layer functions to sigmoidUnitDerivatives: >layers ← getLayers(darch) >for(i in length(layers):1){ layers[[i]][[2]] ← sigmoidUnitDerivative } >setLayers(darch) ← layers >rm(layers) Finally, the following two lines perform the backpropagation: >setFineTuneFunction(darch) ← backpropagation >darch ← fineTuneDArch(darch,inputs,outputs,maxEpoch=1000) We can see the prediction quality of DBN on the training data itself by running darch as follows: >darch ← getExecuteFunction(darch)(darch,inputs) >outputs_darch ← getExecOutputs(darch) >outputs_darch[[2]] [,1] [1,] 9.998474e-01 [2,] 4.921130e-05 [3,] 9.997649e-01 [4,] 3.796699e-05 Comparing with the actual output, DBN has predicted the wrong output for the first and second input rows. Since this example was just to illustrate how to use the darch package, we are not worried about the 50% accuracy here. Other deep learning packages in R Although there are some other deep learning packages in R such as deepnet and RcppDL, compared with libraries in other languages such as Cuda (C++) and Theano (Python), R yet does not have good native libraries for deep learning. The only available package is a wrapper for the Java-based deep learning open source project H2O. This R package, h20, allows running H2O via its REST API from within R. Readers who are interested in serious deep learning projects and applications should use H2O using h2o packages in R. One needs to install H2O in your machine to use h2o. Summary We have learned one of the latest advances in neural networks that is called deep learning. It can be used to solve many problems such as computer vision and natural language processing that involves highly cognitive elements. The artificial intelligent systems using deep learning were able to achieve accuracies comparable to human intelligence in tasks such as speech recognition and image classification. To know more about Bayesian modeling in R, check out Learning Bayesian Models with R (https://www.packtpub.com/big-data-and-business-intelligence/learning-bayesian-models-r). You can also check out our other R books, Data Analysis with R (https://www.packtpub.com/big-data-and-business-intelligence/data-analysis-r), and Machine Learning with R - Second Edition (https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-r-second-edition). Resources for Article: Further resources on this subject: Working with Data – Exploratory Data Analysis [article] Big Data Analytics [article] Deep learning in R [article]
Read more
  • 0
  • 0
  • 7211
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $15.99/month. Cancel anytime
article-image-recommending-movies-scale-python
Packt
15 Feb 2016
57 min read
Save for later

Recommending Movies at Scale (Python)

Packt
15 Feb 2016
57 min read
In this article, we will cover the following recipes: Modeling preference expressions Understanding the data Ingesting the movie review data Finding the highest-scoring movies Improving the movie-rating system Measuring the distance between users in the preference space Computing the correlation between users Finding the best critic for a user Predicting movie ratings for users Collaboratively filtering item by item Building a nonnegative matrix factorization model Loading the entire dataset into the memory Dumping the SVD-based model to the disk Training the SVD-based model Testing the SVD-based model (For more resources related to this topic, see here.) Introduction From books to movies to people to follow on Twitter, recommender systems carve the deluge of information on the Internet into a more personalized flow, thus improving the performance of e-commerce, web, and social applications. It is no great surprise, given the success of Amazon-monetizing recommendations and the Netflix Prize, that any discussion of personalization or data-theoretic prediction would involve a recommender. What is surprising is how simple recommenders are to implement yet how susceptible they are to vagaries of sparse data and overfitting. Consider a non-algorithmic approach to eliciting recommendations; one of the easiest ways to garner a recommendation is to look at the preferences of someone we trust. We are implicitly comparing our preferences to theirs, and the more similarities you share, the more likely you are to discover novel, shared preferences. However, everyone is unique, and our preferences exist across a variety of categories and domains. What if you could leverage the preferences of a great number of people and not just those you trust? In the aggregate, you would be able to see patterns, not just of people like you, but also "anti-recommendations"— things to stay away from, cautioned by the people not like you. You would, hopefully, also see subtle delineations across the shared preference space of groups of people who share parts of your own unique experience. It is this basic premise that a group of techniques called "collaborative filtering" use to make recommendations. Simply stated, this premise can be boiled down to the assumption that those who have similar past preferences will share the same preferences in the future. This is from a human perspective, of course, and a typical corollary to this assumption is from the perspective of the things being preferred—sets of items that are preferred by the same people will be more likely to preferred together in the future—and this is the basis for what is commonly described in the literature as user-centric collaborative filtering versus item-centric collaborative filtering. The term collaborative filtering was coined by David Goldberg in a paper titled Using collaborative filtering to weave an information tapestry, ACM, where he proposed a system called Tapestry, which was designed at Xerox PARC in 1992, to annotate documents as interesting or uninteresting and to give document recommendations to people who are searching for good reads. Collaborative filtering algorithms search large groupings of preference expressions to find similarities to some input preference or preferences. The output from these algorithms is a ranked list of suggestions that is a subset of all possible preferences, and hence, it's called "filtering". The "collaborative" comes from the use of many other peoples' preferences in order to find suggestions for themselves. This can be seen either as a search of the space of preferences (for brute-force techniques), a clustering problem (grouping similarly preferred items), or even some other predictive model. Many algorithmic attempts have been created in order to optimize or solve this problem across sparse or large datasets, and we will discuss a few of them in this article. The goals of this article are: Understanding how to model preferences from a variety of sources Learning how to compute similarities using distance metrics Modeling recommendations using matrix factorization for star ratings These two different models will be implemented in Python using readily available datasets on the Web. To demonstrate the techniques in this article, we will use the oft-cited MovieLens database from the University of Minnesota that contains star ratings of moviegoers for their preferred movies. Modeling preference expressions We have already pointed out that companies such as Amazon track purchases and page views to make recommendations, Goodreads and Yelp use 5 star ratings and text reviews, and sites such as Reddit or Stack Overflow use simple up/down voting. You can see that preference can be expressed in the data in different ways, from Boolean flags to voting to ratings. However, these preferences are expressed by attempting to find groups of similarities in preference expressions in which you are leveraging the core assumption of collaborative filtering. More formally, we understand that two people, Bob and Alice, share a preference for a specific item or widget. If Alice too has a preference for a different item, say, sprocket, then Bob has a better than random chance of also sharing a preference for a sprocket. We believe that Bob and Alice's taste similarities can be expressed in an aggregate via a large number of preferences, and by leveraging the collaborative nature of groups, we can filter the world of products. How to do it… We will model preference expressions over the next few recipes, including: Understanding the data Ingesting the movie review data Finding the highest rated movies Improving the movie-rating system How it works… A preference expression is an instance of a model of demonstrable relative selection. That is to say, preference expressions are data points that are used to show subjective ranking between a group of items for a person. Even more formally, we should say that preference expressions are not simply relative, but also temporal—for example, the statement of preference also has a fixed time relativity as well as item relativity. Preference expression is an instance of a model of demonstrable relative selection. While it would be nice to think that we can subjectively and accurately express our preferences in a global context (for example, rate a movie as compared to all other movies), our tastes, in fact, change over time, and we can really only consider how we rank items relative to each other. Models of preference must take this into account and attempt to alleviate biases that are caused by it. The most common types of preference expression models simplify the problem of ranking by causing the expression to be numerically fuzzy, for example: Boolean expressions (yes or no) Up and down voting (such as abstain, dislike) Weighted signaling (the number of clicks or actions) Broad ranked classification (stars, hated or loved) The idea is to create a preference model for an individual user—a numerical model of the set of preference expressions for a particular individual. Models build the individual preference expressions into a useful user-specific context that can be computed against. Further reasoning can be performed on the models in order to alleviate time-based biases or to perform ontological reasoning or other categorizations. As the relationships between entities get more complex, you can express their relative preferences by assigning behavioral weights to each type of semantic connection. However, choosing the weight is difficult and requires research to decide relative weights, which is why fuzzy generalizations are preferred. As an example, the following table shows you some well-known ranking preference systems: Reddit Voting   Online Shopping   Star Reviews   Up Vote 1 Bought 2 Love 5 No Vote 0 Viewed 1 Liked 4 Down Vote -1 No purchase 0 Neutral 3         Dislike 2         Hate 1 For the rest of this article, we will only consider a single, very common preference expression: star ratings on a scale of 1 to 5. Understanding the data Understanding your data is critical to all data-related work. In this recipe, we acquire and take a first look at the data that we will be using to build our recommendation engine. Getting ready To prepare for this recipe, and the rest of the article, download the MovieLens data from the GroupLens website of the University of Minnesota. You can find the data at http://grouplens.org/datasets/movielens/. In this article, we will use the smaller MoveLens 100k dataset (4.7 MB in size) in order to load the entire model into the memory with ease. How to do it… Perform the following steps to better understand the data that we will be working with throughout this article: Download the data from http://grouplens.org/datasets/movielens/. The 100K dataset is the one that you want (ml-100k.zip). Unzip the downloaded data into the directory of your choice. The two files that we are mainly concerned with are u.data, which contains the user movie ratings, and u.item, which contains movie information and details. To get a sense of each file, use the head command at the command prompt for Mac and Linux or the more command for Windows: head -n 5 u.item Note that if you are working on a computer running the Microsoft Windows operating system and not using a virtual machine (not recommended), you do not have access to the head command; instead, use the following command: more u.item 2 n The preceding command gives you the following output: 1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title- exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0 |0 2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title- exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0 3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title- exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1| 0|0 4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title- exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0| 0|0 5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title- exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0 The following command will produce the given output: head -n 5 u.data For Windows, you can use the following command: more u.item 2 n 196 242 3 881250949 186 302 3 891717742 22 377 1 878887116 244 51 2 880606923 166 346 1 886397596 How it works… The two main files that we will be using are as follows: u.data: This contains the user moving ratings u.item: This contains the movie information and other details Both are character-delimited files; u.data, which is the main file, is tab delimited, and u.item is pipe delimited. For u.data, the first column is the user ID, the second column is the movie ID, the third is the star rating, and the last is the timestamp. The u.item file contains much more information, including the ID, title, release date, and even a URL to IMDB. Interestingly, this file also has a Boolean array indicating the genre(s) of each movie, including (in order) action, adventure, animation, children, comedy, crime, documentary, drama, fantasy, film-noir, horror, musical, mystery, romance, sci-fi, thriller, war, and western. There's more… Free, web-scale datasets that are appropriate for building recommendation engines are few and far between. As a result, the movie lens dataset is a very popular choice for such a task but there are others as well. The well-known Netflix Prize dataset has been pulled down by Netflix. However, there is a dump of all user-contributed content from the Stack Exchange network (including Stack Overflow) available via the Internet Archive (https://archive.org/details/stackexchange). Additionally, there is a book-crossing dataset that contains over a million ratings of about a quarter million different books (http://www2.informatik.uni-freiburg.de/~cziegler/BX/). Ingesting the movie review data Recommendation engines require large amounts of training data in order to do a good job, which is why they're often relegated to big data projects. However, to build a recommendation engine, we must first get the required data into memory and, due to the size of the data, must do so in a memory-safe and efficient way. Luckily, Python has all of the tools to get the job done, and this recipe shows you how. Getting ready You will need to have the appropriate movie lens dataset downloaded, as specified in the preceding recipe. Ensure that you have NumPy correctly installed. How to do it… The following steps guide you through the creation of the functions that we will need in order to load the datasets into the memory: Open your favorite Python editor or IDE. There is a lot of code, so it should be far simpler to enter directly into a text file than Read Eval Print Loop (REPL). We create a function to import the movie reviews: import csv import datetime def load_reviews(path, **kwargs): """ Loads MovieLens reviews """ options = { 'fieldnames': ('userid', 'movieid', 'rating', 'timestamp'), 'delimiter': 't', } options.update(kwargs) parse_date = lambda r,k: datetime.fromtimestamp(float(r[k])) parse_int = lambda r,k: int(r[k]) with open(path, 'rb') as reviews: reader = csv.DictReader(reviews, **options) for row in reader: row['userid'] = parse_int(row, 'userid') row['movieid'] = parse_int(row, 'movieid') row['rating'] = parse_int(row, 'rating') row['timestamp'] = parse_date(row, 'timestamp') yield row We create a helper function to help import the data: import os def relative_path(path): """ Returns a path relative from this code file """ dirname = os.path.dirname(os.path.realpath('__file__')) path = os.path.join(dirname, path) return os.path.normpath(path)  We create another function to load the movie information: def load_movies(path, **kwargs): """ Loads MovieLens movies """ options = { 'fieldnames': ('movieid', 'title', 'release', 'video', 'url'),'delimiter': '|','restkey': 'genre', } options.update(kwargs) parse_int = lambda r,k: int(r[k]) parse_date = lambda r,k: datetime.strptime(r[k], '%d-%b- %Y') if r[k] else None with open(path, 'rb') as movies: reader = csv.DictReader(movies, **options) for row in reader: row['movieid'] = parse_int(row, 'movieid') row['release'] = parse_date(row, 'release') row['video'] = parse_date(row, 'video')             yield row Finally, we start creating a MovieLens class that will be augmented in later recipes: from collections import defaultdict class MovieLens(object): """ Data structure to build our recommender model on. """ def __init__(self, udata, uitem): """ Instantiate with a path to u.data and u.item """ self.udata = udata self.uitem = uitem self.movies = {} self.reviews = defaultdict(dict) self.load_dataset() def load_dataset(self): """ Loads the two datasets into memory, indexed on the ID. """ for movie in load_movies(self.uitem): self.movies[movie['movieid']] = movie for review in load_reviews(self.udata): self.reviews[review['userid']][review['movieid']] = review  Ensure that the functions have been imported into your REPL or the IPython workspace, and type the following, making sure that the path to the data files is appropriate for your system: data = relative_path('data/ml-100k/u.data') item = relative_path('data/ml-100k/u.item') model = MovieLens(data, item) How it works… The methodology that we use for the two data-loading functions (load_reviews and load_movies) is simple, but it takes care of the details of parsing the data from the disk. We created a function that takes a path to our dataset and then any optional keywords. We know that we have specific ways in which we need to interact with the csv module, so we create default options, passing in the field names of the rows along with the delimiter, which is t. The options.update(kwargs) line means that we'll accept whatever users pass to this function. We then created internal parsing functions using a lambda function in Python. These simple parsers take a row and a key as input and return the converted input. This is an example of using lambda as internal, reusable code blocks and is a common technique in Python. Finally, we open our file and create a csv.DictReader function with our options. Iterating through the rows in the reader, we parse the fields that we want to be int and datetime, respectively, and then yield the row. Note that as we are unsure about the actual size of the input file, we are doing this in a memory-safe manner using Python generators. Using yield instead of return ensures that Python creates a generator under the hood and does not load the entire dataset into the memory. We'll use each of these methodologies to load the datasets at various times through our computation that uses this dataset. We'll need to know where these files are at all times, which can be a pain, especially in larger code bases; in the There's more… section, we'll discuss a Python pro-tip to alleviate this concern. Finally, we created a data structure, which is the MovieLens class, with which we can hold our reviews' data. This structure takes the udata and uitem paths, and then, it loads the movies and reviews into two Python dictionaries that are indexed by movieid and userid, respectively. To instantiate this object, you will execute something as follows: data = relative_path('../data/ml-100k/u.data') item = relative_path('../data/ml-100k/u.item') model = MovieLens(data, item) Note that the preceding commands assume that you have your data in a folder called data. We can now load the whole dataset into the memory, indexed on the various IDs specified in the dataset. Did you notice the use of the relative_path function? When dealing with fixtures such as these to build models, the data is often included with the code. When you specify a path in Python, such as data/ml-100k/u.data, it looks it up relative to the current working directory where you ran the script. To help ease this trouble, you can specify the paths that are relative to the code itself: import os def relative_path(path): """ Returns a path relative from this code file """ dirname = os.path.dirname(os.path.realpath('__file__')) path = os.path.join(dirname, path) return os.path.normpath(path) Keep in mind that this holds the entire data structure in memory; in the case of the 100k dataset, this will require 54.1 MB, which isn't too bad for modern machines. However, we should also keep in mind that we'll generally build recommenders using far more than just 100,000 reviews. This is why we have configured the data structure the way we have—very similar to a database. To grow the system, you will replace the reviews and movies properties with database access functions or properties, which will yield data types expected by our methods. Finding the highest-scoring movies If you're looking for a good movie, you'll often want to see the most popular or best rated movies overall. Initially, we'll take a naïve approach to compute a movie's aggregate rating by averaging the user reviews for each movie. This technique will also demonstrate how to access the data in our MovieLens class. Getting ready These recipes are sequential in nature. Thus, you should have completed the previous recipes in the article before starting with this one. How to do it… Follow these steps to output numeric scores for all movies in the dataset and compute a top-10 list: Augment the MovieLens class with a new method to get all reviews for a particular movie: class MovieLens(object): ... def reviews_for_movie(self, movieid): """ Yields the reviews for a given movie """ for review in self.reviews.values(): if movieid in review: yield review[movieid] Then, add an additional method to compute the top 10 movies reviewed by users: import heapq from operator import itemgetter class MovieLens(object): ... def average_reviews(self): """ Averages the star rating for all movies. Yields a tuple of movieid, the average rating, and the number of reviews. """ for movieid in self.movies: reviews = list(r['rating'] for r in self.reviews_for_movie(movieid)) average = sum(reviews) / float(len(reviews)) yield (movieid, average, len(reviews)) def top_rated(self, n=10): """ Yields the n top rated movies """ return heapq.nlargest(n, self.average_reviews(), key=itemgetter(1)) Note that the … notation just below class MovieLens(object): signifies that we will be appending the average_reviews method to the existing MovieLens class. Now, let's print the top-rated results: for mid, avg, num in model.top_rated(10): title = model.movies[mid]['title'] print "[%0.3f average rating (%i reviews)] %s" % (avg, num,title) Executing the preceding commands in your REPL should produce the following output: [5.000 average rating (1 reviews)] Entertaining Angels: The Dorothy Day Story (1996) [5.000 average rating (2 reviews)] Santa with Muscles (1996) [5.000 average rating (1 reviews)] Great Day in Harlem, A (1994) [5.000 average rating (1 reviews)] They Made Me a Criminal (1939) [5.000 average rating (1 reviews)] Aiqing wansui (1994) [5.000 average rating (1 reviews)] Someone Else's America (1995) [5.000 average rating (2 reviews)] Saint of Fort Washington, The (1993) [5.000 average rating (3 reviews)] Prefontaine (1997) [5.000 average rating (3 reviews)] Star Kid (1997) [5.000 average rating (1 reviews)] Marlene Dietrich: Shadow and Light (1996) How it works… The new reviews_for_movie() method that is added to the MovieLens class iterates through our review dictionary values (which are indexed by the userid parameter), checks whether the movieid value has been reviewed by the user, and then presents that review dictionary. We will need such functionality for the next method. With the average_review() method, we have created another generator function that goes through all of our movies and all of their reviews and presents the movie ID, the average rating, and the number of reviews. The top_rated function uses the heapq module to quickly sort the reviews based on the average. The heapq data structure, also known as the priority queue algorithm, is the Python implementation of an abstract data structure with interesting and useful properties. Heaps are binary trees that are built so that every parent node has a value that is either less than or equal to any of its children nodes. Thus, the smallest element is the root of the tree, which can be accessed in constant time, which is a very desirable property. With heapq, Python developers have an efficient means to insert new values in an ordered data structure and also return sorted values. There's more… Here, we run into our first problem—some of the top-rated movies only have one review (and conversely, so do the worst-rated movies). How do you compare Casablanca, which has a 4.457 average rating (243 reviews), with Santa with Muscles, which has a 5.000 average rating (2 reviews)? We are sure that those two reviewers really liked Santa with Muscles, but the high rating for Casablanca is probably more meaningful because more people liked it. Most recommenders with star ratings will simply output the average rating along with the number of reviewers, allowing the user to determine their quality; however, as data scientists, we can do better in the next recipe. See also The heapq documentation available at https://docs.python.org/2/library/heapq.html Improving the movie-rating system We don't want to build a recommendation engine with a system that considers the likely straight-to-DVD Santa with Muscles as generally superior to Casablanca. Thus, the naïve scoring approach used previously must be improved upon and is the focus of this recipe. Getting ready Make sure that you have completed the previous recipes in this article first. How to do it… The following steps implement and test a new movie-scoring algorithm: Let's implement a new Bayesian movie-scoring algorithm as shown in the following function, adding it to the MovieLens class: def bayesian_average(self, c=59, m=3): """ Reports the Bayesian average with parameters c and m. """ for movieid in self.movies: reviews = list(r['rating'] for r in self.reviews_for_movie(movieid)) average = ((c * m) + sum(reviews)) / float(c + len(reviews)) yield (movieid, average, len(reviews))   Next, we will replace the top_rated method in the MovieLens class with the version in the following commands that uses the new Bayesian_average method from the preceding step: def top_rated(self, n=10): """ Yields the n top rated movies """ return heapq.nlargest(n, self.bayesian_average(), key=itemgetter(1))  Printing our new top-10 list looks a bit more familiar to us and Casablanca is now happily rated number 4: [4.234 average rating (583 reviews)] Star Wars (1977) [4.224 average rating (298 reviews)] Schindler's List (1993) [4.196 average rating (283 reviews)] Shawshank Redemption, The (1994) [4.172 average rating (243 reviews)] Casablanca (1942) [4.135 average rating (267 reviews)] Usual Suspects, The (1995) [4.123 average rating (413 reviews)] Godfather, The (1972) [4.120 average rating (390 reviews)] Silence of the Lambs, The (1991) [4.098 average rating (420 reviews)] Raiders of the Lost Ark (1981) [4.082 average rating (209 reviews)] Rear Window (1954) [4.066 average rating (350 reviews)] Titanic (1997) How it works… Taking the average of movie reviews, as in shown the previous recipe, simply did not work because some movies did not have enough ratings to give a meaningful comparison to movies with more ratings. What we'd really like is to have every single movie critic rate every single movie. Given that this is impossible, we could derive an estimate for how the movie would be rated if an infinite number of people rated the movie; this is hard to infer from one data point, so we should say that we would like to estimate the movie rating if the same number of people gave it a rating on an average (for example, filtering our results based on the number of reviews). This estimate can be computed with a Bayesian average, implemented in the bayesian_average() function, to infer these ratings based on the following equation:   Here, m is our prior for the average of stars, and C is a confidence parameter that is equivalent to the number of observations in our posterior. Determining priors can be a complicated and magical art. Rather than taking the complex path of fitting a Dirichlet distribution to our data, we can simply choose an m prior of 3 with our 5-star rating system, which means that our prior assumes that star ratings tend to be reviewed around the median value. In choosing C, you are expressing how many reviews are needed to get away from the prior; we can compute this by looking at the average number of reviews per movie: print float(sum(num for mid, avg, num in model.average_reviews())) / len(model.movies) This gives us an average number of 59.4, which we use as the default value in our function definition. There's more… Play around with the C parameter. You should find that if you change the parameter so that C = 50, the top-10 list subtly shifts; in this case, Schindler's List and Star Wars are swapped in rankings, as are Raiders of the Lost Ark and Rear Window— note that both the swapped movies have far more reviews than the former, which means that the higher C parameter was balancing the fewer ratings of the other movie. See also See how Yelp deals with this challenge at http://venturebeat.com/2009/10/12/how-yelp-deals-with-everybody-getting-four-stars-on-average/ Measuring the distance between users in the preference space The two most recognizable types of collaborative filtering systems are user-based recommenders and item-based recommenders. If one were to imagine that the preference space is an N-dimensional feature space where either users or items are plotted, then we would say that similar users or items tend to cluster near each other in this preference space; hence, an alternative name for this type of collaborative filtering is nearest neighbor recommenders. A crucial step in this process is to come up with a similarity or distance metric with which we can compare critics to each other or mutually preferred items. This metric is then used to perform pairwise comparisons of a particular user to all other users, or conversely, for an item to be compared to all other items. Normalized comparisons are then used to determine recommendations. Although the computational space can become exceedingly large, distance metrics themselves are not difficult to compute, and in this recipe, we will explore a few as well as implement our first recommender system. In this recipe, we will measure the distance between users; in the recipe after this one, we will look at another similarity distance indicator. Getting ready We will continue to build on the MovieLens class from the section titled Modeling Preference. If you have not had the opportunity to review this section, please have the code for that class ready. Importantly, we will want to access the data structures, MovieLens.movies and MovieLens.reviews, that have been loaded from the CSV files on the disk. How to do it… The following set of steps provide instructions on how to compute the Euclidean distance between users: Augment the MovieLens class with a new method, shared_preferences, to pull out movies that have been rated by two critics, A and B: class MovieLens(objects): ... def shared_preferences(self, criticA, criticB): """ Returns the intersection of ratings for two critics """ if criticA not in self.reviews: raise KeyError("Couldn't find critic '%s' in data" % criticA) if criticB not in self.reviews: raise KeyError("Couldn't find critic '%s' in data" % criticB) moviesA = set(self.reviews[criticA].keys()) moviesB = set(self.reviews[criticB].keys()) shared = moviesA & moviesB # Intersection operator # Create a reviews dictionary to return reviews = {} for movieid in shared: reviews[movieid] = ( self.reviews[criticA][movieid]['rating'], self.reviews[criticB][movieid]['rating'], ) return reviews Then, implement a function that computes the Euclidean distance between two critics using their shared movie preferences as a vector for the computation. This method will also be part of the MovieLens class: from math import sqrt ... def euclidean_distance(self, criticA, criticB): """ Reports the Euclidean distance of two critics, A&B by performing a J-dimensional Euclidean calculation of each of their preference vectors for the intersection of movies the critics have rated. """ # Get the intersection of the rated titles in the data. preferences = self.shared_preferences(criticA, criticB) # If they have no rankings in common, return 0. if len(preferences) == 0: return 0 # Sum the squares of the differences sum_of_squares = sum([pow(a-b, 2) for a, b in preferences.values()]) # Return the inverse of the distance to give a higher score to # folks who are more similar (e.g. less distance) add 1 to prevent # division by zero errors and normalize ranks in [0, 1] return 1 / (1 + sqrt(sum_of_squares)) With the preceding code implemented, test it in REPL: >>> data = relative_path('data/ml-100k/u.data') >>> item = relative_path('data/ml-100k/u.item') >>> model = MovieLens(data, item) >>> print model.euclidean_distance(232, 532) 0.1023021629920016 How it works… The new shared_preferences() method of the MovieLens class determines the shared preference space of two users. Critically, we can only compare users (the criticA and criticB input parameters) based on the things that they have both rated. This function uses Python sets to determine the list of movies that both A and B reviewed (the intersection of the movies A has rated and the movies B has rated). The function then iterates over this set, returning a dictionary whose keys are the movie IDs and the values are a tuple of ratings, for example, (ratingA, ratingB) for each movie that both users have rated. We can now use this dataset to compute similarity scores, which is done by the second function. The euclidean_distance() function takes two critics as the input, A and B, and computes the distance between users in preference space. Here, we have chosen to implement the Euclidean distance metric (the two-dimensional variation is well known to those who remember the Pythagorean theorem), but we could have implemented other metrics as well. This function will return a real number from 0 to 1, where 0 is less similar (farther apart) critics and 1 is more similar (closer together) critics. There's more… The Manhattan distance is another very popular metric and a very simple one to understand. It can simply sum the absolute values of the pairwise differences between elements of each vector. Or, in code, it can be executed in this manner: manhattan = sum([abs(a-b) for a, b in preferences.values()]) This metric is also called the city-block distance because, conceptually, it is as if you were counting the number of blocks north/south and east/west one would have to walk between two points in the city. Before implementing it for this recipe, you would also want to invert and normalize the value in some fashion to return a value in the [0, 1] range. See also The distance overview from Wikipedia available at http://en.wikipedia.org/wiki/Distance The Taxicab geometry from Wikipedia available at http://en.wikipedia.org/wiki/Taxicab_geometry Computing the correlation between users In the previous recipe, we used one out of many possible distance measures to capture the distance between the movie reviews of users. This distance between two specific users is not changed even if there are five or five million other users. In this recipe, we will compute the correlation between users in the preference space. Like distance metrics, there are many correlation metrics. The most popular of these are Pearson or Spearman correlations or Cosine distance. Unlike distance metrics, the correlation will change depending on the number of users and movies. Getting ready We will be continuing the efforts of the previous recipes again, so make sure you understand each one. How to do it… The following function implements the computation of the pearson_correlation function for two critics, which are criticA and criticB, and is added to the MovieLens class: def pearson_correlation(self, criticA, criticB): """ Returns the Pearson Correlation of two critics, A and B by performing the PPMC calculation on the scatter plot of (a, b) ratings on the shared set of critiqued titles. """ # Get the set of mutually rated items preferences = self.shared_preferences(criticA, criticB) # Store the length to save traversals of the len computation. # If they have no rankings in common, return 0. length = len(preferences) if length == 0: return 0 # Loop through the preferences of each critic once and compute the # various summations that are required for our final calculation. sumA = sumB = sumSquareA = sumSquareB = sumProducts = 0 for a, b in preferences.values(): sumA += a sumB += b sumSquareA += pow(a, 2) sumSquareB += pow(b, 2) sumProducts += a*b # Calculate Pearson Score numerator = (sumProducts*length) - (sumA*sumB) denominator = sqrt(((sumSquareA*length) - pow(sumA, 2)) * ((sumSquareB*length) - pow(sumB, 2))) # Prevent division by zero. if denominator == 0: return 0 return abs(numerator / denominator) How it works… The Pearson correlation computes the "product moment", which is the mean of the product of mean adjusted random variables and is defined as the covariance of two variables (a and b, in our case) divided by the product of the standard deviation of a and the standard deviation of b. As a formula, this looks like the following:   For a finite sample, which is what we have, the detailed formula, which was implemented in the preceding function, is as follows:   Another way to think about the Pearson correlation is as a measure of the linear dependence between two variables. It returns a score of -1 to 1, where negative scores closer to -1 indicate a stronger negative correlation, and positive scores closer to 1 indicate a stronger, positive correlation. A score of 0 means that the two variables are not correlated. In order for us to perform comparisons, we want to normalize our similarity metrics in the space of [0, 1] so that 0 means less similar and 1 means more similar, so we return the absolute value: >>> print model.pearson_correlation(232, 532) 0.06025793538385047 There's more… We have explored two distance metrics: the Euclidean distance and the Pearson correlation. There are many more, including the Spearman correlation, Tantimoto scores, Jaccard distance, Cosine similarity, and Manhattan distance, to name a few. Choosing the right distance metric for the dataset of your recommender along with the type of preference expression used is crucial to ensuring success in this style of recommender. It's up to the reader to explore this space further based on his or her interest and particular dataset. Finding the best critic for a user Now that we have two different ways to compute a similarity distance between users, we can determine the best critics for a particular user and see how similar they are to an individual's preferences. Getting ready Make sure that you have completed the previous recipes before tackling this one. How to do it… Implement a new method for the MovieLens class, similar_critics(), that locates the best match for a user: import heapq ... def similar_critics(self, user, metric='euclidean', n=None): """ Finds, ranks similar critics for the user according to the specified distance metric. Returns the top n similar critics if n is specified. """ # Metric jump table metrics = { 'euclidean': self.euclidean_distance, 'pearson': self.pearson_correlation, } distance = metrics.get(metric, None) # Handle problems that might occur if user not in self.reviews: raise KeyError("Unknown user, '%s'." % user) if not distance or not callable(distance): raise KeyError("Unknown or unprogrammed distance metric '%s'." % metric) # Compute user to critic similarities for all critics critics = {} for critic in self.reviews: # Don't compare against yourself! if critic == user: continue critics[critic] = distance(user, critic) if n: return heapq.nlargest(n, critics.items(), key=itemgetter(1)) return critics How it works… The similar_critics method, added to the MovieLens class, serves as the heart of this recipe. It takes as parameters the targeted user and two optional parameters: the metric to be used, which defaults to euclidean, and the number of results to be returned, which defaults to None. As you can see, this flexible method uses a jump table to determine what algorithm is to be used (you can pass in euclidean or pearson to choose the distance metric). Every other critic is compared to the current user (except a comparison of the user against themselves). The results are then sorted using the flexible heapq module and the top n results are returned. To test out our implementation, print out the results of the run for both similarity distances: >>> for item in model.similar_critics(232, 'euclidean', n=10): print "%4i: %0.3f" % item 688: 1.000 914: 1.000 47: 0.500 78: 0.500 170: 0.500 335: 0.500 341: 0.500 101: 0.414 155: 0.414 309: 0.414 >>> for item in model.similar_critics(232, 'pearson', n=10): print "%4i: %0.3f" % item 33: 1.000 36: 1.000 155: 1.000 260: 1.000 289: 1.000 302: 1.000 309: 1.000 317: 1.000 511: 1.000 769: 1.000 These scores are clearly very different, and it appears that Pearson thinks that there are much more similar users than the Euclidean distance metric. The Euclidean distance metric tends to favor users who have rated fewer items exactly the same. Pearson correlation favors more scores that fit well linearly, and therefore, Pearson corrects grade inflation where two critics might rate movies very similarly, but one user rates them consistently one star higher than the other. If you plot out how many shared rankings each critic has, you'll see that the data is very sparse. Here is the preceding data with the number of rankings appended: Euclidean scores: 688: 1.000 (1 shared rankings) 914: 1.000 (2 shared rankings) 47: 0.500 (5 shared rankings) 78: 0.500 (3 shared rankings) 170: 0.500 (1 shared rankings) Pearson scores: 33: 1.000 (2 shared rankings) 36: 1.000 (3 shared rankings) 155: 1.000 (2 shared rankings) 260: 1.000 (3 shared rankings) 289: 1.000 (3 shared rankings) Therefore, it is not enough to find similar critics and use their ratings to predict our users' scores; instead, we will have to aggregate the scores of all of the critics, regardless of similarity, and predict ratings for the movies we haven't rated. Predicting movie ratings for users To predict how we might rate a particular movie, we can compute a weighted average of critics who have also rated the same movies as the user. The weight will be the similarity of the critic to user—if a critic has not rated a movie, then their similarity will not contribute to the overall ranking of the movie. Getting ready Ensure that you have completed the previous recipes in this large, cumulative article. How to do it… The following steps walk you through the prediction of movie ratings for users: First, add the predict_ranking function to the MovieLens class in order to predict the ranking a user might give a particular movie with similar critics: def predict_ranking(self, user, movie, metric='euclidean', critics=None): """ Predicts the ranking a user might give a movie based on the weighted average of the critics similar to the that user. """ critics = critics or self.similar_critics(user, metric=metric) total = 0.0 simsum = 0.0 for critic, similarity in critics.items(): if movie in self.reviews[critic]: total += similarity * self.reviews[critic][movie]['rating'] simsum += similarity if simsum == 0.0: return 0.0 return total / simsum Next, add the predict_all_rankings method to the MovieLens class: def predict_all_rankings(self, user, metric='euclidean', n=None): """ Predicts all rankings for all movies, if n is specified returns the top n movies and their predicted ranking. """ critics = self.similar_critics(user, metric=metric) movies = { movie: self.predict_ranking(user, movie, metric, critics) for movie in self.movies } if n: return heapq.nlargest(n, movies.items(), key=itemgetter(1)) return movies How it works… The predict_ranking method takes a user and a movie along with a string specifying the distance metric and returns the predicted rating for that movie for that particular user. A fourth argument, critics, is meant to be an optimization for the predict_all_rankings method, which we'll discuss shortly. The prediction gathers all critics who are similar to the user and computes the weighted total rating of the critics, filtered by those who actually did rate the movie in question. The weights are simply their similarity to the user, computed by the distance metric. This total is then normalized by the sum of the similarities to move the rating back into the space of 1 to 5 stars: >>> print model.predict_ranking(422, 50, 'euclidean') 4.35413151722 >>> print model.predict_ranking(422, 50, 'pearson') 4.3566797826 Here, we can see the predictions for Star Wars (ID 50 in our MovieLens dataset) for the user 422. The Euclidean and Pearson computations are very close to each other (which isn't necessarily to be expected), but the prediction is also very close to the user's actual rating, which is 4. The predict_all_rankings method computes the ranking predictions for all movies for a particular user according to the passed-in metric. It optionally takes a value, n, to return the top n best matches. This function optimizes the similar critics' lookup by only executing it once and then passing those discovered critics to the predict_ranking function in order to improve the performance. However, this method must be run on every single movie in the dataset: >>> for mid, rating in model.predict_all_rankings(578, 'pearson', 10): ... print "%0.3f: %s" % (rating, model.movies[mid]['title']) 5.000: Prefontaine (1997) 5.000: Santa with Muscles (1996) 5.000: Marlene Dietrich: Shadow and Light (1996) 5.000: Star Kid (1997) 5.000: Aiqing wansui (1994) 5.000: Someone Else's America (1995) 5.000: Great Day in Harlem, A (1994) 5.000: Saint of Fort Washington, The (1993) 4.954: Anna (1996) 4.817: Innocents, The (1961) As you can see, we have now computed what our recommender thinks the top movies for this particular user are, along with what we think the user will rate the movie! The top-10 list of average movie ratings plays a huge rule here and a potential improvement could be to use the Bayesian averaging in addition to the similarity weighting, but that is left for the reader to implement. Collaboratively filtering item by item So far, we have compared users to other users in order to make our predictions. However, the similarity space can be partitioned in two ways. User-centric collaborative filtering plots users in the preference space and discovers how similar users are to each other. These similarities are then used to predict rankings, aligning the user with similar critics. Item-centric collaborative filtering does just the opposite; it plots the items together in the preference space and makes recommendations according to how similar a group of items are to another group. Item-based collaborative filtering is a common optimization as the similarity of items changes slowly. Once enough data has been gathered, reviewers adding reviews does not necessarily change the fact that Toy Story is more similar to Babe than The Terminator, and users who prefer Toy Story might prefer the former to the latter. Therefore, you can simply compute item similarities once in a single offline-process and use that as a static mapping for recommendations, updating the results on a semi-regular basis. This recipe will walk you through item-by-item collaborative filtering. Getting ready This recipe requires the completion of the previous recipes in this article. How to do it… Construct the following function to perform item-by-item collaborative filtering: def shared_critics(self, movieA, movieB): """ Returns the intersection of critics for two items, A and B """ if movieA not in self.movies: raise KeyError("Couldn't find movie '%s' in data" %movieA) if movieB not in self.movies: raise KeyError("Couldn't find movie '%s' in data" %movieB) criticsA = set(critic for critic in self.reviews if movieA in self.reviews[critic]) criticsB = set(critic for critic in self.reviews if movieB in self.reviews[critic]) shared = criticsA & criticsB # Intersection operator # Create the reviews dictionary to return reviews = {} for critic in shared: reviews[critic] = ( self.reviews[critic][movieA]['rating'], self.reviews[critic][movieB]['rating'], ) return reviews def similar_items(self, movie, metric='euclidean', n=None): # Metric jump table metrics = { 'euclidean': self.euclidean_distance, 'pearson': self.pearson_correlation, } distance = metrics.get(metric, None) # Handle problems that might occur if movie not in self.reviews: raise KeyError("Unknown movie, '%s'." % movie) if not distance or not callable(distance): raise KeyError("Unknown or unprogrammed distance metric '%s'." % metric) items = {} for item in self.movies: if item == movie: continue items[item] = distance(item, movie, prefs='movies') if n: return heapq.nlargest(n, items.items(), key=itemgetter(1)) return items How it works… To perform item-by-item collaborative filtering, the same distance metrics can be used but must be updated to use the preferences from shared_critics rather than shared_preferences (for example, item similarity versus user similarity). Update the functions to accept a prefs parameter that determines which preferences are to be used, but I'll leave that to the reader as it is only two lines of code. If you print out the list of similar items for a particular movie, you can see some interesting results. For example, review the similarity results for The Crying Game (1992), which has an ID of 631: for movie, similarity in model.similar_items(631, 'pearson').items(): print "%0.3f: %s" % (similarity, model.movies[movie]['title']) 0.127: Toy Story (1995) 0.209: GoldenEye (1995) 0.069: Four Rooms (1995) 0.039: Get Shorty (1995) 0.340: Copycat (1995) 0.225: Shanghai Triad (Yao a yao yao dao waipo qiao) (1995) 0.232: Twelve Monkeys (1995) ... This crime thriller is not very similar to Toy Story, which is a children's movie, but is more similar to Copycat, which is another crime thriller. Of course, critics who have rated many movies skew the results, and more movie reviews are needed before this normalizes into something more compelling. It is presumed that the item similarity scores are run regularly but do not need to be computed in real time. Given a set of computed item similarities, computing recommendations are as follows: def predict_ranking(self, user, movie, metric='euclidean'): movies = self.similar_items(movie, metric=metric) total = 0.0 simsum = 0.0 for relmovie, similarity in movies.items(): # Ignore movies already reviewed by user if relmovie in self.reviews[user]: total += similarity * self.reviews[user][relmovie]['rating'] simsum += similarity if simsum == 0.0: return 0.0 return total / simsum This method simply uses the inverted item-to-item similarity scores rather than the user-to-user similarity scores. Since similar items can be computed offline, the lookup for movies via the self.similar_items method should be a database lookup rather than a real-time computation. >>> print model.predict_ranking(232, 52, 'pearson') 3.980443976 You can then compute a ranked list of all possible recommendations in a similar way as the user-to-user recommendations. Building a nonnegative matrix factorization model A general improvement on the basic cross-wise nearest-neighbor similarity scoring of collaborative filtering is a matrix factorization method, which is also known as Singular Value Decomposition (SVD). Matrix factorization methods attempt to explain the ratings through the discovery of latent features that are not easily identifiable by analysts. For instance, this technique can expose possible features such as the amount of action, family friendliness, or fine-tuned genre discovery in our movies dataset. What's especially interesting about these features is that they are continuous and not discrete values and can represent an individual's preference along a continuum. In this sense, the model can explore shades of characteristics, for example, perhaps a critic in the movie reviews' dataset, such as action flicks with a strong female lead that are set in European countries. A James Bond movie might represent a shade of that type of movie even though it only ticks the set in European countries and action genre boxes. Depending on how similarly reviewers rate the movie, the strength of the female counterpart to James Bond will determine how they might like the movie. Also, extremely helpfully, the matrix factorization model does well on sparse data, that is data with few recommendation and movie pairs. Reviews' data is particularly sparse because not everyone has rated the same movies and there is a massive set of available movies. SVD can also be performed in parallel, making it a good choice for much larger datasets. In the remaining recipes in this article, we will build a nonnegative matrix factorization model in order to improve our recommendation engine. How to do it… Loading the entire dataset into the memory. Dumping the SVD-based model to the disk. Training the SVD-based model. Testing the SVD-based model. How it works… Matrix factorization, or SVD works, by finding two matrices such that when you take their dot product (also known as the inner product or scalar product), you will get a close approximation of the original matrix. We have expressed our training matrix as a sparse N x M matrix of users to movies where the values are the 5-star rating if it exists, otherwise, the value is blank or 0. By factoring the model with the values that we have and then taking the dot product of the two matrices produced by the factorization, we hope to fill in the blank spots in our original matrix with a prediction of how the user would have rated the movie in that column. The intuition is that there should be some latent features that determine how users rate an item, and these latent features are expressed through the semantics of their previous ratings. If we can discover the latent features, we will be able to predict new ratings. Additionally, there should be fewer features than there are users and movies (otherwise, each movie or user would be a unique feature). This is why we compose our factored matrices by some feature length before taking their dot product. Mathematically, this task is expressed as follows. If we have a set of U users and M movies, let R of size |U| x |M| be the matrix that contains the ratings of users. Assuming that we have K latent features, find two matrices, P and Q, where P is |U| x K and Q is |M| x K such that the dot product of P and Q transpose approximates R. P, which therefore represent the strength of the associations between users and features and Q represents the association of movies with features. There are a few ways to go about factorization, but the choice we made was to perform gradient descent. Gradient descent initializes two random P and Q matrices, computes their dot product, and then minimizes the error compared to the original matrix by traveling down a slope of an error function (the gradient). This way, the algorithm hopes to find a local minimum where the error is within an acceptable threshold. Our function computed the error as the squared difference between the predicted value and the actual value. To minimize the error, we modify the values pik and qkj by descending along the gradient of the current error slope, differentiating our error equation with respect to p yields: We then differentiate our error equation with respect to the variable q yields in the following equation: We can then derive our learning rule, which updates the values in P and Q by a constant learning rate, which is α. This learning rate, α, should not be too large because it determines how big of a step we take towards the minimum, and it is possible to step across to the other side of the error curve. It should also not be too small, otherwise it will take forever to converge. We continue to update our P and Q matrices, minimizing the error until the sum of the error squared is below some threshold, 0.001 in our code, or until we have performed a maximum number of iterations. Matrix factorization has become an important technique for recommender systems, particularly those that leverage Likert-scale-like preference expressions—notably, star ratings. The Netflix Prize challenge has shown us that matrix-factored approaches perform with a high degree of accuracy for ratings prediction tasks. Additionally, matrix factorization is a compact, memory-efficient representation of the parameter space for a model and can be trained in parallel, can support multiple feature vectors, and can be improved with confidence levels. Generally, they are used to solve cold-start problems with sparse reviews and in an ensemble with more complex hybrid-recommenders that also compute content-based recommenders. See also Wikipedia's overview of the dot product available at http://en.wikipedia.org/wiki/Dot_product Loading the entire dataset into the memory The first step in building a nonnegative factorization model is to load the entire dataset in the memory. For this task, we will be leveraging NumPy highly. Getting ready In order to complete this recipe, you'll have to download the MovieLens database from the University of Minnesota GroupLens page at http://grouplens.org/datasets/movielens/ and unzip it in a working directory where your code will be. We will also use NumPy in this code significantly, so please ensure that you have this numerical analysis package downloaded and ready. Additionally, we will use the load_reviews function from the previous recipes. If you have not had the opportunity to review the appropriate section, please have the code for that function ready. How to do it… To build our matrix factorization model, we'll need to create a wrapper for the predictor that loads the entire dataset into memory. We will perform the following steps: We create the following Recommender class as shown. Please note that this class depends on the previously created and discussed load_reviews function: import numpy as np import csv class Recommender(object): def __init__(self, udata): self.udata = udata self.users = None self.movies = None self.reviews = None self.load_dataset() def load_dataset(self): """ Load an index of users & movies as a heap and reviews table as a N x M array where N is the number of users and M is the number of movies. Note that order matters so that we can look up values outside of the matrix! """ self.users = set([]) self.movies = set([]) for review in load_reviews(self.udata): self.users.add(review['userid']) self.movies.add(review['movieid']) self.users = sorted(self.users) self.movies = sorted(self.movies) self.reviews = np.zeros(shape=(len(self.users), len(self.movies))) for review in load_reviews(self.udata): uid = self.users.index(review['userid']) mid = self.movies.index(review['movieid']) self.reviews[uid, mid] = review['rating'] With this defined, we can instantiate a model by typing the following command: data_path = '../data/ml-100k/u.data' model = Recommender(data_path) How it works… Let's go over this code line by line. The instantiation of our recommender requires a path to the u.data file; creates holders for our list of users, movies, and reviews; and then loads the dataset. We need to hold the entire dataset in memory for reasons that we will see later. The basic data structure to perform our matrix factorization on is an N x M matrix where N is the number of users and M is the number of movies. To create this, we will first load all the movies and users into an ordered list so that we can look up the index of the user or movie by its ID. In the case of MovieLens, all of the IDs are contiguous from 1; however, this might not always be the case. It is good practice to have an index lookup table. Otherwise, you will be unable to fetch recommendations from our computation! Once we have our index lookup lists, we create a NumPy array of all zeroes in the size of the length of our users' list by the length of our movies list. Keep in mind that the rows are users and the columns are movies! We then go through the ratings data a second time and then add the value of the rating at the uid, mid index location of our matrix. Note that if a user hasn't rated a movie, their rating is 0. This is important! Print the array out by entering model.reviews, and you should see something as follows: [[ 5. 3. 4. ..., 0. 0. 0.] [ 4. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] ..., [ 5. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 5. 0. ..., 0. 0. 0.]] There's more… Let's get a sense of how sparse or dense our dataset is by adding the following two methods to the Recommender class: def sparsity(self): """ Report the percent of elements that are zero in the array """ return 1 - self.density() def density(self): """ Return the percent of elements that are nonzero in the array """ nonzero = float(np.count_nonzero(self.reviews)) return nonzero / self.reviews.size Adding these methods to our Recommender class will help us evaluate our recommender, and it will also help us identify recommenders in the future. Print out the results: print "%0.3f%% sparse" % model.sparsity() print "%0.3f%% dense" % model.density() You should see that the MovieLens 100k dataset is 0.937 percent sparse and 0.063 percent dense. This is very important to keep note of along with the size of the reviews dataset. Sparsity, which is common to most recommender systems, means that we might be able to use sparse matrix algorithms and optimizations. Additionally, as we begin to save models, this will help us identify the models as we load them from serialized files on the disk. Dumping the SVD-based model to the disk Before we build our model, which will take a long time to train, we should create a mechanism for us to load and dump our model to the disk. If we have a way of saving the parameterization of the factored matrix, then we can reuse our model without having to train it every time we want to use it—this is a very big deal since this model will take hours to train! Luckily, Python has a built-in tool for serializing and deserializing Python objects—the pickle module. How to do it… Update the Recommender class as follows: import pickle class Recommender(object): @classmethod def load(klass, pickle_path): """ Instantiates the class by deserializing the pickle. Note that the object returned may not be an exact match to the code in this class (if it was saved before updates). """ with open(pickle_path, 'rb') as pkl: return pickle.load(pkl) def __init__(self, udata, description=None): self.udata = udata self.users = None self.movies = None self.reviews = None # Descriptive properties self.build_start = None self.build_finish = None self.description = None # Model properties self.model = None self.features = 2 self.steps = 5000 self.alpha = 0.0002 self.beta = 0.02 self.load_dataset() def dump(self, pickle_path): """ Dump the object into a serialized file using the pickle module. This will allow us to quickly reload our model in the future. """ with open(pickle_path, 'wb') as pkl: pickle.dump(self, pkl) How it works… The @classmethod feature is a decorator in Python for declaring a class method instead of an instance method. The first argument that is passed in is the type instead of an instance (which we usually refer to as self). The load class method takes a path to a file on the disk that contains a serialized pickle object, which it then loads using the pickle module. Note that the class that is returned might not be an exact match with the Recommender class at the time you run the code—this is because the pickle module saves the class, including methods and properties, exactly as it was when you dumped it. Speaking of dumping, the dump method provides the opposite functionality, allowing you to serialize the methods, properties, and data to disk in order to be loaded again in the future. To help us identify the objects that we're dumping and loading from disk, we've also added some descriptive properties including a description, some build parameters, and some timestamps to our __init__ function. Training the SVD-based model We're now ready to write our functions that factor our training dataset and build our recommender model. You can see the required functions in this recipe. How to do it… We construct the following functions to train our model. Note that these functions are not part of the Recommender class: def initialize(R, K): """ Returns initial matrices for an N X M matrix, R and K features. :param R: the matrix to be factorized :param K: the number of latent features :returns: P, Q initial matrices of N x K and M x K sizes """ N, M = R.shape P = np.random.rand(N,K) Q = np.random.rand(M,K) return P, Q def factor(R, P=None, Q=None, K=2, steps=5000, alpha=0.0002, beta=0.02): """ Performs matrix factorization on R with given parameters. :param R: A matrix to be factorized, dimension N x M :param P: an initial matrix of dimension N x K :param Q: an initial matrix of dimension M x K :param K: the number of latent features :param steps: the maximum number of iterations to optimize in :param alpha: the learning rate for gradient descen :param beta: the regularization parameter :returns: final matrices P and Q """ if not P or not Q: P, Q = initialize(R, K) Q = Q.T rows, cols = R.shape for step in xrange(steps): for i in xrange(rows): for j in xrange(cols): if R[i,j] > 0: eij = R[i,j] - np.dot(P[i,:], Q[:,j]) for k in xrange(K): P[i,k] = P[i,k] + alpha * (2 * eij * Q[k,j] - beta * P[i,k]) Q[k,j] = Q[k,j] + alpha * (2 * eij * P[i,k] - beta * Q[k,j]) e = 0 for i in xrange(rows): for j in xrange(cols): if R[i,j] > 0: e = e + pow(R[i,j] - np.dot(P[i,:], Q[:,j]), 2) for k in xrange(K): e = e + (beta/2) * (pow(P[i,k], 2) + pow(Q[k,j], 2)) if e < 0.001: break return P, Q.T How it works… We discussed the theory and the mathematics of what we are doing in the previous recipe, Building a non-negative matrix factorization model, so let's talk about the code. The initialize function creates two matrices, P and Q, that have a size related to the reviews matrix and the number of features, namely N x K and M x K, where N is the number of users and M is the number of movies. Their values are initialized to random numbers that are between 0.0 and 1.0. The factor function computes P and Q using gradient descent such that the dot product of P and Q is within a mean squared error of less than 0.001 or 5000 steps that have gone by, whichever comes first. Especially note that only values that are greater than 0 are computed. These are the values that we're trying to predict; therefore, we do not want to attempt to match them in our code (otherwise, the model will be trained on zero ratings)! This is also the reason that you can't use NumPy's built-in Singular Value Decomposition (SVD) function, which is np.linalg.svd or np.linalg.solve. There's more… Let's use these factorization functions to build our model and to save the model to disk once it has been built—this way, we can load the model at our convenience using the dump and load methods in the class. Add the following method to the Recommender class: def build(self, output=None): """ Trains the model by employing matrix factorization on training data set, (sparse reviews matrix). The model is the dot product of the P and Q decomposed matrices from the factorization. """ options = { 'K': self.features, 'steps': self.steps, 'alpha': self.alpha, 'beta': self.beta, } self.build_start = time.time() self.P, self.Q = factor(self.reviews, **options) self.model = np.dot(self.P, self.Q.T) self.build_finish = time.time() if output: self.dump(output) This helper function will allow us to quickly build our model. Note that we're also saving P and Q—the parameters of our latent features. This isn't necessary, as our predictive model is the dot product of the two factored matrices. Deciding whether or not to save this information in your model is a trade-off between re-training time (you can potentially start from the current P and Q parameters although you must beware of the overfit) and disk space, as pickle will be larger on the disk with these matrices saved. To build this model and dump the data to the disk, run the following code: model = Recommender(relative_path('../data/ml-100k/u.data')) model.build('reccod.pickle') Warning! This will take a long time to build! On a 2013 MacBook Pro with a 2.8 GHz processor, this process took roughly 9 hours 15 minutes and required 23.1 MB of memory; this is not insignificant for most of the Python scripts you might be used to writing! It is not a bad idea to continue through the rest of the recipe before building your model. It is also probably not a bad idea to test your code on a smaller test set of 100 records before moving on to the entire process! Additionally, if you don't have the time to train the model, you can find the pickle module of our model in the errata of this book. Testing the SVD-based model This recipe brings this article on recommendation engines to a close. We use our new nonnegative matrix factorization-based model and take a look at some of the predicted reviews. How to do it… The final step in leveraging our model is to access the predicted reviews for a movie based on our model: def predict_ranking(self, user, movie): uidx = self.users.index(user) midx = self.movies.index(movie) if self.reviews[uidx, midx] > 0: return None return self.model[uidx, midx] How it works… Computing the ranking is relatively easy; we simply need to look up the index of the user and the index of the movie and look up the predicted rating in our model. This is why it is so essential to save an ordered list of the users and movies in our pickle module; this way, if the data changes (we add users or movies) but the change isn't reflected in our model, an exception is raised. Because models are historical predictions and not sensitive to changes in time, we need to ensure that we continually retrain our model with new data. This method also returns None if we know the ranking of the user (for example, it's not a prediction); we'll leverage this in the next step. There's more… To predict the highest-ranked movies, we can leverage the previous function to order the highest predicted rankings for our user: import heapq from operator import itemgetter def top_rated(self, user, n=12): movies = [(mid, self.predict_ranking(user, mid)) for mid in self.movies] return heapq.nlargest(n, movies, key=itemgetter(1)) We can now print out the top-predicted movies that have not been rated by the user: >>> rec = Recommender.load('reccod.pickle') >>> for item in rec.top_rated(234): ... print "%i: %0.3f" % item 814: 4.437 1642: 4.362 1491: 4.361 1599: 4.343 1536: 4.324 1500: 4.323 1449: 4.281 1650: 4.147 1645: 4.135 1467: 4.133 1636: 4.133 1651: 4.132 It's then simply a matter of using the movie ID to look up the movie in our movies database. Summary To learn more about Data Science, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Principles of Data Science (https://www.packtpub.com/big-data-and-business-intelligence/principles-data-science) Python Data Science Cookbook (https://www.packtpub.com/big-data-and-business-intelligence/python-data-science-cookbook) R for Data Science (https://www.packtpub.com/big-data-and-business-intelligence/r-data-science) Resources for Article: Further resources on this subject: Big Data[article] Big Data Analysis (R and Hadoop)[article] Visualization of Big Data[article]
Read more
  • 0
  • 0
  • 9913

article-image-searching-your-data
Packt
12 Feb 2016
22 min read
Save for later

Searching Your Data

Packt
12 Feb 2016
22 min read
In this article by Rafał Kuć and Marek Rogozinski the authors of this book Elasticsearch Server Third Edition, we dived into Elasticsearch indexing. We learned a lot when it comes to data handling. We saw how to tune Elasticsearch schema-less mechanism and we now know how to create our own mappings. We also saw the core types of Elasticsearch and we used analyzers – both the one that comes out of the box with Elasticsearch and the one we define ourselves. We used bulk indexing, and we added additional internal information to our indices. Finally, we learned what segment merging is, how we can fine tune it, and how to use routing in Elasticsearch and what it gives us. This article is fully dedicated to querying. By the end of this article, you will have learned the following topics: How to query Elasticsearch Using the script process Understanding the querying process (For more resources related to this topic, see here.) Querying Elasticsearch So far, when we searched our data, we used the REST API and a simple query or the GET request. Similarly, when we were changing the index, we also used the REST API and sent the JSON-structured data to Elasticsearch. Regardless of the type of operation we wanted to perform, whether it was a mapping change or document indexation, we used JSON structured request body to inform Elasticsearch about the operation details. A similar situation happens when we want to send more than a simple query to Elasticsearch we structure it using the JSON objects and send it to Elasticsearch in the request body. This is called the query DSL. In a broader view, Elasticsearch supports two kinds of queries: basic ones and compound ones. Basic queries, such as the term query, are used for querying the actual data. The second type of query is the compound query, such as the bool query, which can combine multiple queries. However, this is not the whole picture. In addition to these two types of queries, certain queries can have filters that are used to narrow down your results with certain criteria. Filter queries don't affect scoring and are usually very efficient and easily cached. To make it even more complicated, queries can contain other queries (don't worry; we will try to explain all this!). Furthermore, some queries can contain filters and others can contain both queries and filters. Although this is not everything, we will stick with this working explanation for now. The example data If not stated otherwise, the following mappings will be used for the rest of the article: { "book" : { "properties" : { "author" : { "type" : "string" }, "characters" : { "type" : "string" }, "copies" : { "type" : "long", "ignore_malformed" : false }, "otitle" : { "type" : "string" }, "tags" : { "type" : "string", "index" : "not_analyzed" }, "title" : { "type" : "string" }, "year" : { "type" : "long", "ignore_malformed" : false, "index" : "analyzed" }, "available" : { "type" : "boolean" } } } } The preceding mappings represent a simple library and were used to create the library index. One thing to remember is that Elasticsearch will analyze the string based fields if we don't configure it differently. The preceding mappings were stored in the mapping.json file and in order to create the mentioned library index we can use the following commands: curl -XPOST 'localhost:9200/library' curl -XPUT 'localhost:9200/library/book/_mapping' -d @mapping.json We also used the following sample data as the example ones for this article: { "index": {"_index": "library", "_type": "book", "_id": "1"}} { "title": "All Quiet on the Western Front","otitle": "Im Westen nichts Neues","author": "Erich Maria Remarque","year": 1929,"characters": ["Paul Bäumer", "Albert Kropp", "Haie Westhus", "Fredrich Müller", "Stanislaus Katczinsky", "Tjaden"],"tags": ["novel"],"copies": 1, "available": true, "section" : 3} { "index": {"_index": "library", "_type": "book", "_id": "2"}} { "title": "Catch-22","author": "Joseph Heller","year": 1961,"characters": ["John Yossarian", "Captain Aardvark", "Chaplain Tappman", "Colonel Cathcart", "Doctor Daneeka"],"tags": ["novel"],"copies": 6, "available" : false, "section" : 1} { "index": {"_index": "library", "_type": "book", "_id": "3"}} { "title": "The Complete Sherlock Holmes","author": "Arthur Conan Doyle","year": 1936,"characters": ["Sherlock Holmes","Dr. Watson", "G. Lestrade"],"tags": [],"copies": 0, "available" : false, "section" : 12} { "index": {"_index": "library", "_type": "book", "_id": "4"}} { "title": "Crime and Punishment","otitle": "Преступлéние и наказáние","author": "Fyodor Dostoevsky","year": 1886,"characters": ["Raskolnikov", "Sofia Semyonovna Marmeladova"],"tags": [],"copies": 0, "available" : true} We stored our sample data in the documents.json file, and we use the following command to index it: curl -s -XPOST 'localhost:9200/_bulk' --data-binary @documents.json A simple query The simplest way to query Elasticsearch is to use the URI request query. For example, to search for the word crime in the title field, you could send a query using the following command: curl -XGET 'localhost:9200/library/book/_search?q=title:crime&pretty' This is a very simple, but limited, way of submitting queries to Elasticsearch. If we look from the point of view of the Elasticsearch query DSL, the preceding query is the query_string query. It searches for the documents that have the term crime in the title field and can be rewritten as follows: { "query" : { "query_string" : { "query" : "title:crime" } } } Sending a query using the query DSL is a bit different, but still not rocket science. We send the GET (POST is also accepted in case your tool or library doesn't allow sending request body in HTTP GET requests) HTTP request to the _search REST endpoint as earlier and include the query in the request body. Let's take a look at the following command: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "query" : { "query_string" : { "query" : "title:crime" } } }' As you can see, we used the request body (the -d switch) to send the whole JSON-structured query to Elasticsearch. The pretty request parameter tells Elasticsearch to structure the response in such a way that we humans can read it more easily. In response to the preceding command, we get the following output: { "took" : 4, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.5, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "4", "_score" : 0.5, "_source" : { "title" : "Crime and Punishment", "otitle" : "Преступлéние и наказáние", "author" : "Fyodor Dostoevsky", "year" : 1886, "characters" : [ "Raskolnikov", "Sofia Semyonovna Marmeladova" ], "tags" : [ ], "copies" : 0, "available" : true } } ] } } Nice! We got our first search results with the query DSL. Paging and result size Elasticsearch allows us to control how many results we want to get (at most) and from which result we want to start. The following are the two additional properties that can be set in the request body: from: This property specifies the document that we want to have our results from. Its default value is 0, which means that we want to get our results from the first document. size: This property specifies the maximum number of documents we want as the result of a single query (which defaults to 10). For example, if weare only interested in aggregations results and don't care about the documents returned by the query, we can set this parameter to 0. If we want our query to get documents starting from the tenth item on the list and get 20 of items from there on, we send the following query: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "from" : 9, "size" : 20, "query" : { "query_string" : { "query" : "title:crime" } } }' Returning the version value In addition to all the information returned, Elasticsearch can return the version of the document. To do this, we need to add the version property with the value of true to the top level of our JSON object. So, the final query, which requests for version information, will look as follows: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "version" : true, "query" : { "query_string" : { "query" : "title:crime" } } }' After running the preceding query, we get the following results: { "took" : 4, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.5, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "4", "_version" : 1, "_score" : 0.5, "_source" : { "title" : "Crime and Punishment", "otitle" : "Преступлéние и наказáние", "author" : "Fyodor Dostoevsky", "year" : 1886, "characters" : [ "Raskolnikov", "Sofia Semyonovna Marmeladova" ], "tags" : [ ], "copies" : 0, "available" : true } } ] } } As you can see, the _version section is present for the single hit we got. Limiting the score For nonstandard use cases, Elasticsearch provides a feature that lets us filter the results on the basis of a minimum score value that the document must have to be considered a match. In order to use this feature, we must provide the min_score value at the top level of our JSON object with the value of the minimum score. For example, if we want our query to only return documents with a score higher than 0.75, we send the following query: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "min_score" : 0.75, "query" : { "query_string" : { "query" : "title:crime" } } }' We get the following response after running the preceding query: { "took" : 3, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 0, "max_score" : null, "hits" : [ ] } } If you look at the previous examples, the score of our document was 0.5, which is lower than 0.75, and thus we didn't get any documents in response. Limiting the score usually doesn't make much sense because comparing scores between the queries is quite hard. However, maybe in your case, this functionality will be needed. Choosing the fields that we want to return With the use of the fields array in the request body, Elasticsearch allows us to define which fields to include in the response. Remember that you can only return these fields if they are marked as stored in the mappings used to create the index, or if the _source field was used (Elasticsearch uses the _source field to provide the stored values and the _source field is turned on by default). So, for example, to return only the title and the year fields in the results (for each document), send the following query to Elasticsearch: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "fields" : [ "title", "year" ], "query" : { "query_string" : { "query" : "title:crime" } } }' In response, we get the following output: { "took" : 5, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.5, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "4", "_score" : 0.5, "fields" : { "title" : [ "Crime and Punishment" ], "year" : [ 1886 ] } } ] } } As you can see, everything worked as we wanted to. There are four things we will like to share with you, which are as follows: If we don't define the fields array, it will use the default value and return the _source field if available. If we use the _source field and request a field that is not stored, then that field will be extracted from the _source field (however, this requires additional processing). If we want to return all the stored fields, we just pass an asterisk (*) as the field name. From a performance point of view, it's better to return the _source field instead of multiple stored fields. This is because getting multiple stored fields may be slower compared to retrieving a single _source field. Source filtering In addition to choosing which fields are returned, Elasticsearch allows us to use the so-called source filtering. This functionality allows us to control which fields are returned from the _source field. Elasticsearch exposes several ways to do this. The simplest source filtering allows us to decide whether a document should be returned or not. Consider the following query: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "_source" : false, "query" : { "query_string" : { "query" : "title:crime" } } }' The result retuned by Elasticsearch should be similar to the following one: { "took" : 12, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.5, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "4", "_score" : 0.5 } ] } } Note that the response is limited to base information about a document and the _source field was not included. If you use Elasticsearch as a second source of data and content of the document is served from SQL database or cache, the document identifier is all you need. The second way is similar to as described in the preceding fields, although we define which fields should be returned in the document source itself. Let's see that using the following example query: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "_source" : ["title", "otitle"], "query" : { "query_string" : { "query" : "title:crime" } } }' We wanted to get the title and the otitle document fields in the returned _source field. Elasticsearch extracted those values from the original _source value and included the _source field only with the requested fields. The whole response returned by Elasticsearch looked as follows: { "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.5, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "4", "_score" : 0.5, "_source" : { "otitle" : "Преступлéние и наказáние", "title" : "Crime and Punishment" } } ] } } We can also use asterisk to select which fields should be returned in the _source field; for example, title* will return value for the title field and for title10 (if we have such field in our data). If we have more extended document with nested part, we can use notation with dot; for example, title.* to select all the fields nested under the title object. Finally, we can also specify explicitly which fields we want to include and which to exclude from the _source field. We can include fields using the include property and we can exclude fields using the exclude property (both of them are arrays of values). For example, if we want the returned _source field to include all the fields starting with the letter t but not the title field, we will run the following query: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "_source" : { "include" : [ "t*"], "exclude" : ["title"] }, "query" : { "query_string" : { "query" : "title:crime" } } }' Using the script fields Elasticsearch allows us to use script-evaluated values that will be returned with the result documents. To use the script fields functionality, we add the script_fields section to our JSON query object and an object with a name of our choice for each scripted value that we want to return. For example, to return a value named correctYear, which is calculated as the year field minus 1800, we run the following query: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "script_fields" : { "correctYear" : { "script" : "doc["year"].value - 1800" } }, "query" : { "query_string" : { "query" : "title:crime" } } }' By default, Elasticsearch doesn't allow us to use dynamic scripting. If you tried the preceding query, you probably got an error with information stating that the scripts of type [inline] with operation [search] and language [groovy] are disabled. To make this example work, you should add the script.inline: on property to the elasticsearch.yml file. However, this exposes a security threat. Using the doc notation, like we did in the preceding example, allows us to catch the results returned and speed up script execution at the cost of higher memory consumption. We also get limited to single-valued and single term fields. If we care about memory usage, or if we are using more complicated field values, we can always use the _source field. The same query using the _source field looks as follows: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "script_fields" : { "correctYear" : { "script" : "_source.year - 1800" } }, "query" : { "query_string" : { "query" : "title:crime" } } }' The following response is returned by Elasticsearch with dynamic scripting enabled: { "took" : 76, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.5, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "4", "_score" : 0.5, "fields" : { "correctYear" : [ 86 ] } } ] } } As you can see, we got the calculated correctYear field in response. Passing parameters to the script fields Let's take a look at one more feature of the script fields - passing of additional parameters. Instead of having the value 1800 in the equation, we can usea variable name and pass its value in the params section. If we do this, our query will look as follows: curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "script_fields" : { "correctYear" : { "script" : "_source.year - paramYear", "params" : { "paramYear" : 1800 } } }, "query" : { "query_string" : { "query" : "title:crime" } } }' As you can see, we added the paramYear variable as part of the scripted equation and provided its value in the params section. This allows Elasticsearch to execute the same script with different parameter values in a slightly more efficient way. Understanding the querying process After reading the previous section, we now know how querying works in Elasticsearch. You know that Elasticsearch, in most cases, needs to scatter the query across multiple nodes, get the results, merge them, fetch the relevant documents from one or more shards, and return the final results to the client requesting the documents. What we didn't talk about are two additional things that define how queries behave: search type and query execution preference. We will now concentrate on these functionalities of Elasticsearch. Query logic Elasticsearch is a distributed search engine and so all functionality provided must be distributed in its nature. It is exactly the same with querying. Because we would like to discuss some more advanced topics on how to control the query process, we first need to know how it works. Let's now get back to how querying works. By default, if we don't alter anything, the query process will consist of two phases: the scatter and the gather phase. The aggregator node (the one that receivesthe request) will run the scatter phase first. During that phase, the query is distributed to all the shards that our index is built of (of course if routing is not used). For example, if it is built of 5 shards and 1 replica then 5 physical shards will be queried (we don't need to query a shard and its replica as they contain the same data). Each of the queried shards will only return the document identifier and the score of the document. The node that sent the scatter query will wait for all the shards to complete their task, gather the results, and sort them appropriately (in this case, from top scoring to the lowest scoring ones). After that, a new request will be sent to build the search results. However, now only to those shards that held the documents to build the response. In most cases, Elasticsearch won't send the request to all the shards but to its subset. That's because we usually don't get the complete result of the query but only a portion of it. This phase is called the gather phase. After all the documents are gathered, the final response is built and returned as the query result. This is the basic and default Elasticsearch behavior but we can change it. Search type Elasticsearch allows us to choose how we want our query to be processed internally. We can do that by specifying the search type. There are different situations where different search type are appropriate: sometimes one can care only about the performance while sometimes query relevance is the most important factor. You should remember that each shard is a small Lucene index and in order to return more relevant results, some information, such as frequencies, needs to be transferred between the shards. To control how the queries are executed, we can pass the search_type request parameter and set it to one of the following values: query_then_fetch: In the first step, the query is executed to get the information needed to sort and rank the documents. This step is executed against all the shards. Then only the relevant shards are queried for the actual content of the documents. Different from query_and_fetch, the maximum number of results returned by this query type will be equal to the size parameter. This is the search type used by default if no search type is provided with the query, and this is the query type we described previously. dfs_query_then_fetch: Again, as with the previous dfs_query_and_fetch, dfs_query_then_fetch is similar to its counterpart query_then_fetch. However, it contains an additional phase comparing which calculates distributed term frequencies. There are also two deprecated search types: count and scan. The first one is deprecated starting from Elasticsearch 2.0 and the second one starting with Elasticsearch 2.1. The first search type used to provide benefits where only aggregations or the number of documents was relevant, but now it is enough to add size equal to 0 to your queries. The scan request was used for scrolling functionality. So if we would like to use the simplest search type, we would run the following command: curl -XGET 'localhost:9200/library/book/_search?pretty&search_type=query_then_fetch' -d '{ "query" : { "term" : { "title" : "crime" } } }' Search execution preference In addition to the possibility of controlling how the query is executed, we can also control on which shards to execute the query. By default, Elasticsearch uses shards and replicas, both the ones available on the node we've sent the request and on the other nodes in the cluster. The default behavior is mostly the proper method of shard preference of queries. But there may be times when we want to change the default behavior. For example, you may want the search to be only executed on the primary shards. To do that, we can set the preference request parameter to one of the following values: _primary: The operation will be only executed on the primary shards, so the replicas won't be used. This can be useful when we need to use the latest information from the index but our data is not replicated right away. _primary_first: The operation will be executed on the primary shards if they are available. If not, it will be executed on the other shards. _replica: The operation will be executed only on the replica shards. _replica_first: This operation is similar to _primary_first, but uses replica shards. The operation will be executed on the replica shards if possible, and on the primary shards if the replicas are not available. _local: The operation will be executed on the shards available on the node which the request was sent and if such shards are not present, the request will be forwarded to the appropriate nodes. _only_node:node_id: This operation will be executed on the node with the provided node identifier. _only_nodes:nodes_spec: This operation will be executed on the nodes that are defined in nodes_spec. This can be an IP address, a name, a name or IP address using wildcards, and so on. For example, if nodes_spec is set to 192.168.1.*, the operation will be run on the nodes with IP address starting with 192.168.1. _prefer_node:node_id: Elasticsearch will try to execute the operation on the node with the provided identifier. However, if the node is not available, it will be executed on the nodes that are available. _shards:1,2: Elasticsearch will execute the operation on the shards with the given identifiers; in this case, on shards with identifiers 1 and 2. The _shards parameter can be combined with other preferences, but the shards identifiers need to be provided first. For example, _shards:1,2;_local. Custom value: Any custom, string value may be passed. Requests with the same values provided will be executed on the same shards. For example, if we would like to execute a query only on the local shards, we would run the following command: curl -XGET 'localhost:9200/library/_search?pretty&preference=_local' -d '{ "query" : { "term" : { "title" : "crime" } } }' Search shards API When discussing the search preference, we will also like to mention the search shards API exposed by Elasticsearch. This API allows us to check which shards the query will be executed at. In order to use this API, run a request against the search_shards rest end point. For example, to see how the query will be executed, we run the following command: curl -XGET 'localhost:9200/library/_search_shards?pretty' -d '{"query":"match_all":{}}' The response to the preceding command will be as follows: { "nodes" : { "my0DcA_MTImm4NE3cG3ZIg" : { "name" : "Cloud 9", "transport_address" : "127.0.0.1:9300", "attributes" : { } } }, "shards" : [ [ { "state" : "STARTED", "primary" : true, "node" : "my0DcA_MTImm4NE3cG3ZIg", "relocating_node" : null, "shard" : 0, "index" : "library", "version" : 4, "allocation_id" : { "id" : "9ayLDbL1RVSyJRYIJkuAxg" } } ], [ { "state" : "STARTED", "primary" : true, "node" : "my0DcA_MTImm4NE3cG3ZIg", "relocating_node" : null, "shard" : 1, "index" : "library", "version" : 4, "allocation_id" : { "id" : "wfpvtaLER-KVyOsuD46Yqg" } } ], [ { "state" : "STARTED", "primary" : true, "node" : "my0DcA_MTImm4NE3cG3ZIg", "relocating_node" : null, "shard" : 2, "index" : "library", "version" : 4, "allocation_id" : { "id" : "zrLPWhCOSTmjlb8TY5rYQA" } } ], [ { "state" : "STARTED", "primary" : true, "node" : "my0DcA_MTImm4NE3cG3ZIg", "relocating_node" : null, "shard" : 3, "index" : "library", "version" : 4, "allocation_id" : { "id" : "efnvY7YcSz6X8X8USacA7g" } } ], [ { "state" : "STARTED", "primary" : true, "node" : "my0DcA_MTImm4NE3cG3ZIg", "relocating_node" : null, "shard" : 4, "index" : "library", "version" : 4, "allocation_id" : { "id" : "XJHW2J63QUKdh3bK3T2nzA" } } ] ] } As you can see, in the response returned by Elasticsearch, we have the information about the shards that will be used during the query process. Of course, with the search shards API, we can use additional parameters that control the querying process. These properties are routing, preference, and local. We are already familiar with the first two. The local parameter is a Boolean (values true or false) one that allows us to tell Elasticsearch to use the cluster state information stored on the local node (setting local to true) instead of the one from the master node (setting local to false). This allows us to diagnose problems with cluster state synchronization. Summary This article has been all about the querying Elasticsearch. We started by looking at how to query Elasticsearch and what Elasticsearch does when it needs to handle the query. We also learned about the basic and compound queries, so we are now able to use both simple queries as well as the ones that group multiple small queries together. Finally, we discussed how to choose the right query for a given use case. Resources for Article: Further resources on this subject: Extending ElasticSearch with Scripting [article] Integrating Elasticsearch with the Hadoop ecosystem [article] Elasticsearch Administration [article]
Read more
  • 0
  • 0
  • 1162

article-image-gradient-descent-work
Packt
03 Feb 2016
11 min read
Save for later

Gradient Descent at Work

Packt
03 Feb 2016
11 min read
In this article by Alberto Boschetti and Luca Massaron authors of book Regression Analysis with Python, we will learn about gradient descent, its feature scaling and a simple implementation. (For more resources related to this topic, see here.) As an alternative from the usual classical optimization algorithms, the gradient descent technique is able to minimize the cost function of a linear regression analysis using much less computations. In terms of complexity, gradient descent ranks in the order O(n*p), thus making learning regression coefficients feasible even in the occurrence of a large n (that stands for the number of observations) and large p (number of variables). The method works by leveraging a simple heuristic that gradually converges to the optimal solution starting from a random one. Explaining it using simple words, it resembles walking blind in the mountains. If you want to descend to the lowest valley, even if you don't know and can't see the path, you can proceed approximately by going downhill for a while, then stopping, then directing downhill again, and so on, always directing at each stage where the surface descends until you arrive at a point when you cannot descend anymore. Hopefully, at that point, you will have reached your destination. In such a situation, your only risk is to pass by an intermediate valley (where there is a wood or a lake for instance) and mistake it for your desired arrival because the land stops descending there. In an optimization process, such a situation is defined as a local minimum (whereas your target is a global minimum, instead of the best minimum possible) and it is a possible outcome of your journey downhill depending on the function you are working on minimizing. The good news, in any case, is that the error function of the linear model family is a bowl-shaped one (technically, our cost function is a concave one) and it is unlikely that you can get stuck anywhere if you properly descend. The necessary steps to work out a gradient-descent-based solution are hereby described. Given our cost function for a set of coefficients (the vector w): We first start by choosing a random initialization for w by choosing some random numbers (taken from a standardized normal curve, for instance, having zero mean and unit variance). Then, we start reiterating an update of the values of w (opportunely using the gradient descent computations) until the marginal improvement from the previous J(w) is small enough to let us figure out that we have finally reached an optimum minimum. We can opportunely update our coefficients, separately one by one, by subtracting from each of them a portion alpha (α, the learning rate) of the partial derivative of the cost function: Here, in our formula, wj is to be intended as a single coefficient (we are iterating over them). After resolving the partial derivative, the final resolution form is: Simplifying everything, our gradient for the coefficient of x is just the average of our predicted values multiplied by their respective x value. We have to notice that by introducing more parameters to be estimated during the optimization procedure, we are actually introducing more dimensions to our line of fit (turning it into a hyperplane, a multidimensional surface) and such dimensions have certain communalities and differences to be taken into account. Alpha, called the learning rate, is very important in the process, because if it is too large, it may cause the optimization to detour and fail. You have to think of each gradient as a jump or as a run in a direction. If you fully take it, you may happen to pass over the optimum minimum and end up in another rising slope. Too many consecutive long steps may even force you to climb up the cost slope, worsening your initial position (given by a cost function that is its summed square, the loss of an overall score of fitness). Using a small alpha, the gradient descent won't jump beyond the solution, but it may take much longer to reach the desired minimum. How to choose the right alpha is a matter of trial and error. Anyway, starting from an alpha, such as 0.01, is never a bad choice based on our experience in many optimization problems. Naturally, the gradient, given the same alpha, will in any case produce shorter steps as you approach the solution. Visualizing the steps in a graph can really give you a hint about whether the gradient descent is working out a solution or not. Though quite conceptually simple (it is based on an intuition that we surely applied ourselves to move step by step where we can optimizing our result), gradient descent is very effective and indeed scalable when working with real data. Such interesting characteristics elevated it to be the core optimization algorithm in machine learning, not being limited to just the linear model's family, but also, for instance, extended to neural networks for the process of back propagation that updates all the weights of the neural net in order to minimize the training errors. Surprisingly, the gradient descent is also at the core of another complex machine learning algorithm, the gradient boosting tree ensembles, where we have an iterative process minimizing the errors using a simpler learning algorithm (a so-called weak learner because it is limited by an high bias) for progressing toward the optimization. Scikit-learn linear_regression and other linear models present in the linear methods module are actually powered by gradient descent, making Scikit-learn our favorite choice while working on data science projects with large and big data. Feature scaling While using the classical statistical approach, not the machine learning one, working with multiple features requires attention while estimating the coefficients because of their similarities that can cause a variance inflection of the estimates. Moreover, multicollinearity between variables also bears other drawbacks because it can render very difficult, if not impossible to achieve, matrix inversions, the matrix operation at the core of the normal equation coefficient estimation (and such a problem is due to the mathematical limitation of the algorithm). Gradient descent, instead, is not affected at all by reciprocal correlation, allowing the estimation of reliable coefficients even in the presence of perfect collinearity. Anyway, though being quite resistant to the problems that affect other approaches, gradient descent's simplicity renders it vulnerable to other common problems, such as the different scale present in each feature. In fact, some features in your data may be represented by the measurements in units, some others in decimals, and others in thousands, depending on what aspect of reality each feature represents. For instance, in the dataset we decide to take as an example, the Boston houses dataset (http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html), a feature is the average number of rooms (a float ranging from about 5 to over 8), others are the percentage of certain pollutants in the air (float between 0 and 1), and so on, mixing very different measurements. When it is the case that the features have a different scale, though the algorithm will be processing each of them separately, the optimization will be dominated by the variables with the more extensive scale. Working in a space of dissimilar dimensions will require more iterations before convergence to a solution (and sometimes, there could be no convergence at all). The remedy is very easy; it is just necessary to put all the features on the same scale. Such an operation is called feature scaling. Feature scaling can be achieved through standardization or normalization. Normalization rescales all the values in the interval between zero and one (usually, but different ranges are also possible), whereas standardization operates removing the mean and dividing by the standard deviation to obtain a unit variance. In our case, standardization is preferable both because it easily permits retuning the obtained standardized coefficients into their original scale and because, centering all the features at the zero mean, it makes the error surface more tractable by many machine learning algorithms, in a much more effective way than just rescaling the maximum and minimum of a variable. An important reminder while applying feature scaling is that changing the scale of the features implies that you will have to use rescaled features also for predictions. A simple implementation Let's try the algorithm first using the standardization based on the Scikit-learn preprocessing module: import numpy as np import random from sklearn.datasets import load_boston from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LinearRegression   boston = load_boston() standardization = StandardScaler() y = boston.target X = np.column_stack((standardization.fit_transform(boston.data), np.ones(len(y)))) In the preceding code, we just standardized the variables using the StandardScaler class from Scikit-learn. This class can fit a data matrix, record its column means and standard deviations, and operate a transformation on itself as well as on any other similar matrixes, standardizing the column data. By means of this method, after fitting, we keep a track of the means and standard deviations that have been used because they will come handy if afterwards we will have to recalculate the coefficients using the original scale. Now, we just record a few functions for the following computations: def random_w(p):     return np.array([np.random.normal() for j in range(p)])   def hypothesis(X, w):     return np.dot(X,w)   def loss(X, w, y):     return hypothesis(X, w) - y   def squared_loss(X, w, y):     return loss(X, w, y)**2   def gradient(X, w, y):     gradients = list()     n = float(len(y))     for j in range(len(w)):         gradients.append(np.sum(loss(X, w, y) * X[:,j]) / n)     return gradients   def update(X, w, y, alpha=0.01):     return [t - alpha*g for t, g in zip(w, gradient(X, w, y))]   def optimize(X, y, alpha=0.01, eta = 10**-12, iterations = 1000):     w = random_w(X.shape[1])     for k in range(iterations):         SSL = np.sum(squared_loss(X,w,y))         new_w = update(X,w,y, alpha=alpha)         new_SSL = np.sum(squared_loss(X,new_w,y))         w = new_w         if k>=5 and (new_SSL - SSL <= eta and new_SSL - SSL >= -eta):             return w     return w We can now calculate our regression coefficients: w = optimize(X, y, alpha = 0.02, eta = 10**-12, iterations = 20000) print ("Our standardized coefficients: " +   ', '.join(map(lambda x: "%0.4f" % x, w))) Our standardized coefficients: -0.9204, 1.0810, 0.1430, 0.6822, -2.0601, 2.6706, 0.0211, -3.1044, 2.6588, -2.0759, -2.0622, 0.8566, -3.7487, 22.5328 A simple comparison with Scikit-learn's solution can prove if our code worked fine: sk=LinearRegression().fit(X[:,:-1],y) w_sk = list(sk.coef_) + [sk.intercept_] print ("Scikit-learn's standardized coefficients: " + ', '.join(map(lambda x: "%0.4f" % x, w_sk))) Scikit-learn's standardized coefficients: -0.9204, 1.0810, 0.1430, 0.6822, -2.0601, 2.6706, 0.0211, -3.1044, 2.6588, -2.0759, -2.0622, 0.8566, -3.7487, 22.5328 A noticeable particular to mention is our choice of alpha. After some tests, the value of 0.02 has been chosen for its good performance on this very specific problem. Alpha is the learning rate and, during optimization, it can be fixed or changed according to a line search method, modifying its value in order to minimize the cost function at each single step of the optimization process. In our example, we opted for a fixed learning rate and we had to look for its best value by trying a few optimization values and deciding on which minimized the cost in the minor number of iterations. Summary In this article we learned about gradient descent, its feature scaling and a simple implementation using an algorithm based on Scikit-learn preprocessing module. Resources for Article:   Further resources on this subject: Optimization Techniques [article] Saving Time and Memory [article] Making Your Data Everything It Can Be [article]
Read more
  • 0
  • 0
  • 2724

article-image-customizing-ipython
Packt
03 Feb 2016
9 min read
Save for later

Customizing IPython

Packt
03 Feb 2016
9 min read
In this article written by Cyrille Rossant, author of Learning IPython for Interactive Computing and Data Visualization - Second edition, we look at how the Jupyter Notebook is a highly-customizable platform. You can configure many aspects of the software and can extend the backend (kernels) and the frontend (HTML-based Notebook). This allows you to create highly-personalized user experiences based on the Notebook. In this article, we will cover the following topics: Creating a custom magic command in an IPython extension Writing a new Jupyter kernel Customizing the Notebook interface with JavaScript Creating a custom magic command in an IPython extension IPython comes with a rich set of magic commands. You can get the complete list with the %lsmagic command. IPython also allows you to create your own magic commands. In this section, we will create a new cell magic that compiles and executes C++ code in the Notebook. We first import the register_cell_magic function: In [1]: from IPython.core.magic import register_cell_magic To create a new cell magic, we create a function that takes a line (containing possible options) and a cell's contents as its arguments, and we decorate it with @register_cell_magic, as shown here: In [2]: @register_cell_magic def cpp(line, cell): """Compile, execute C++ code, and return the standard output.""" # We first retrieve the current IPython interpreter # instance. ip = get_ipython() # We define the source and executable filenames. source_filename = '_temp.cpp' program_filename = '_temp' # We write the code to the C++ file. with open(source_filename, 'w') as f: f.write(cell) # We compile the C++ code into an executable. compile = ip.getoutput("g++ {0:s} -o {1:s}".format( source_filename, program_filename)) # We execute the executable and return the output. output = ip.getoutput('./{0:s}'.format(program_filename)) print('n'.join(output)) C++ compiler This recipe requires the gcc C++ compiler. On Ubuntu, type sudo apt-get install build-essential in a terminal. On OS X, install Xcode. On Windows, install MinGW (http://www.mingw.org) and make sure that g++ is in your system path. This magic command uses the getoutput() method of the IPython InteractiveShell instance. This object represents the current interactive session. It defines many methods for interacting with the session. You will find the comprehensive list at http://ipython.org/ipython-doc/dev/api/generated/IPython.core.interactiveshell.html#IPython.core.interactiveshell.InteractiveShell. Let's now try this new cell magic. In [3]: %%cpp #include<iostream> int main() { std::cout << "Hello world!"; } Out[3]: Hello world! This cell magic is currently only available in your interactive session. To distribute it, you need to create an IPython extension. This is a regular Python module or package that extends IPython. To create an IPython extension, copy the definition of the cpp() function (without the decorator) to a Python module, named cpp_ext.py for example. Then, add the following at the end of the file: def load_ipython_extension(ipython): """This function is called when the extension is loaded. It accepts an IPython InteractiveShell instance. We can register the magic with the `register_magic_function` method of the shell instance.""" ipython.register_magic_function(cpp, 'cell') Then, you can load the extension with %load_ext cpp_ext. The cpp_ext.py file needs to be in the PYTHONPATH, for example in the current directory. Writing a new Jupyter kernel Jupyter supports a wide variety of kernels written in many languages, including the most-frequently used IPython. The Notebook interface lets you choose the kernel for every notebook. This information is stored within each notebook file. The jupyter kernelspec command allows you to get information about the kernels. For example, jupyter kernelspec list lists the installed kernels. Type jupyter kernelspec --help for more information. At the end of this section, you will find references with instructions to install various kernels including IR, IJulia, and IHaskell. Here, we will detail how to create a custom kernel. There are two methods to create a new kernel: Writing a kernel from scratch for a new language by re-implementing the whole Jupyter messaging protocol. Writing a wrapper kernel for a language that can be accessed from Python. We will use the second, easier method in this section. Specifically, we will reuse the example from the last section to write a C++ wrapper kernel. We need to slightly refactor the last section's code because we won't have access to the InteractiveShell instance. Since we're creating a kernel, we need to put the code in a Python script in a new folder named cpp: In [1]: %mkdir cpp The %%writefile cell magic lets us create a cpp_kernel.py Python script from the Notebook: In [2]: %%writefile cpp/cpp_kernel.py import os import os.path as op import tempfile # We import the `getoutput()` function provided by IPython. # It allows us to do system calls from Python. from IPython.utils.process import getoutput def exec_cpp(code): """Compile, execute C++ code, and return the standard output.""" # We create a temporary directory. This directory will # be deleted at the end of the 'with' context. # All created files will be in this directory. with tempfile.TemporaryDirectory() as tmpdir: # We define the source and executable filenames. source_path = op.join(tmpdir, 'temp.cpp') program_path = op.join(tmpdir, 'temp') # We write the code to the C++ file. with open(source_path, 'w') as f: f.write(code) # We compile the C++ code into an executable. os.system("g++ {0:s} -o {1:s}".format( source_path, program_path)) # We execute the program and return the output. return getoutput(program_path) Out[2]: Writing cpp/cpp_kernel.py Now we create our wrapper kernel by appending some code to the cpp_kernel.py file created above (that's what the -a option in the %%writefile cell magic is for): In [3]: %%writefile -a cpp/cpp_kernel.py """C++ wrapper kernel.""" from ipykernel.kernelbase import Kernel class CppKernel(Kernel): # Kernel information. implementation = 'C++' implementation_version = '1.0' language = 'c++' language_version = '1.0' language_info = {'name': 'c++', 'mimetype': 'text/plain'} banner = "C++ kernel" def do_execute(self, code, silent, store_history=True, user_expressions=None, allow_stdin=False): """This function is called when a code cell is executed.""" if not silent: # We run the C++ code and get the output. output = exec_cpp(code) # We send back the result to the frontend. stream_content = {'name': 'stdout', 'text': output} self.send_response(self.iopub_socket, 'stream', stream_content) return {'status': 'ok', # The base class increments the execution # count 'execution_count': self.execution_count, 'payload': [], 'user_expressions': {}, } if __name__ == '__main__': from ipykernel.kernelapp import IPKernelApp IPKernelApp.launch_instance(kernel_class=CppKernel) Out[3]: Appending to cpp/cpp_kernel.py In production code, it would be best to test the compilation and execution, and to fail gracefully by showing an error. See the references at the end of this section for more information. Our wrapper kernel is now implemented in cpp/cpp_kernel.py. The next step is to create a cpp/kernel.json file describing our kernel: In [4]: %%writefile cpp/kernel.json { "argv": ["python", "cpp/cpp_kernel.py", "-f", "{connection_file}" ], "display_name": "C++" } Out[4]: Writing cpp/kernel.json The argv field describes the command that is used to launch a C++ kernel. More information can be found in the references below. Finally, let's install this kernel with the following command: In [5]: !jupyter kernelspec install --replace --user cpp Out[5]: [InstallKernelSpec] Installed kernelspec cpp in /Users/cyrille/Library/Jupyter/kernels/cpp The --replace option forces the installation even if the kernel already exists. The --user option serves to install the kernel in the user directory. We can test the installation of the kernel with the following command: In [6]: !jupyter kernelspec list Out[6]: Available kernels: cpp python3 Now, C++ notebooks can be created in the Notebook, as shown in the following screenshot: C++ kernel in the Notebook Finally, wrapper kernels can also be used in the IPython terminal or the Qt console, using the --kernel option, for example ipython console --kernel cpp. Here are a few references: Kernel documentation at http://jupyter-client.readthedocs.org/en/latest/kernels.html Wrapper kernels at http://jupyter-client.readthedocs.org/en/latest/wrapperkernels.html List of kernels at https://github.com/ipython/ipython/wiki/IPython%20kernels%20for%20other%20languages bash kernel at https://github.com/takluyver/bash_kernel R kernel at https://github.com/takluyver/IRkernel Julia kernel at https://github.com/JuliaLang/IJulia.jl Haskell kernel at https://github.com/gibiansky/IHaskell Customizing the Notebook interface with JavaScript The Notebook application exposes a JavaScript API that allows for a high level of customization. In this section, we will create a new button in the Notebook toolbar to renumber the cells. The JavaScript API is not stable and not well-documented. Although the example in this section has been tested with IPython 4.0, nothing guarantees that it will work in future versions without changes. The commented JavaScript code below adds a new Renumber button. In [1]: %%javascript // This function allows us to add buttons // to the Notebook toolbar. IPython.toolbar.add_buttons_group([ { // The button's label. 'label': 'Renumber all code cells', // The button's icon. // See a list of Font-Awesome icons here: // http://fortawesome.github.io/Font-Awesome/icons/ 'icon': 'fa-list-ol', // The callback function called when the button is // pressed. 'callback': function () { // We retrieve the lists of all cells. var cells = IPython.notebook.get_cells(); // We only keep the code cells. cells = cells.filter(function(c) { return c instanceof IPython.CodeCell; }) // We set the input prompt of all code cells. for (var i = 0; i < cells.length; i++) { cells[i].set_input_prompt(i + 1); } } }]); Executing this cell displays a new button in the Notebook toolbar, as shown in the following screenshot: Adding a new button in the Notebook toolbar You can use the jupyter nbextension command to install notebook extensions (use the --help option to see the list of possible commands). Here are a few repositories with custom JavaScript extensions contributed by the community: https://github.com/minrk/ipython_extensions https://github.com/ipython-contrib/IPython-notebook-extensions So, we have covered several customization options of IPython and the Jupyter Notebook, but there’s so much more that can be done. Take a look at the IPython Interactive Computing and Visualization Cookbook to learn how to create your own custom widgets in the Notebook.
Read more
  • 0
  • 0
  • 3494
article-image-fpga-mining
Packt
29 Jan 2016
6 min read
Save for later

FPGA Mining

Packt
29 Jan 2016
6 min read
In this article by Albert Szmigielski, author of the book Bitcoin Essentials, we will take a look at mining with Field-Programmable Gate Arrays, or FPGAs. These are microprocessors that can be programmed for a specific purpose. In the case of bitcoin mining, they are configured to perform the SHA-256 hash function, which is used to mine bitcoins. FPGAs have a slight advantage over GPUs for mining. The period of FPGA mining of bitcoins was rather short (just under a year), as faster machines became available. The advent of ASIC technology for bitcoin mining compelled a lot of miners to make the move from FPGAs to ASICs. Nevertheless, FPGA mining is worth learning about. We will look at the following: Pros and cons of FPGA mining FPGA versus other hardware mining Best practices when mining with FPGAs Discussion of profitability (For more resources related to this topic, see here.) Pros and cons of FPGA mining Mining with an FPGA has its advantages and disadvantages. Let's examine these in order to better understand if and when it is appropriate to use FPGAs to mine bitcoins. As you may recall, mining started on CPUs, moved over to GPUs, and then people discovered that FPGAs could be used for mining as well. Pros of FPGA mining FPGA mining is the third step in mining hardware evolution. They are faster and more efficient than GPUs. In brief, mining bitcoins with FPGAs has the following advantages: FPGAs are faster than GPUs and CPUs FPGAs are more electricity-efficient per unit of hashing than CPUs or GPUs Cons of FPGA mining FPGAs are rather difficult to source and program. They are not usually sold in stores open to the public. We have not touched upon programming FPGAs to mine bitcoins as it is assumed that the reader has already acquired preprogrammed FPGAs. There are several good resources regarding FPGA programming on the Internet. Electricity costs are also an issue with FPGAs, although not as big as with GPUs. To summarize, mining bitcoins with FPGAs has the following disadvantages: Electricity costs Hardware costs Fierce competition with other miners Best practices when mining with FPGAs Let's look at the recommended things to do when mining with FPGAs. Mining is fun, and it could also be profitable if several factors are taken into account. Make sure that all your FPGAs have adequate cooling. Additional fans beyond what is provided by the manufacturer are always a good idea. Remove dust frequently, as a buildup of dust might have a detrimental effect on cooling efficiency, and therefore, mining speed. For your particular mining machine, look up all the optimization tweaks online in order to get all the hashing power possible out of the device. When setting up a mining operation for profit, keep in mind that electricity costs will be a large percentage of your overall costs. Seek a location with the lowest electricity rates. Think about cooling costs—perhaps it would be most beneficial to mine somewhere where the climate is cooler. When purchasing FPGAs, make sure you calculate hashes per dollar of hardware costs, and also hashes per unit of electricity used. In mining, electricity has the biggest cost after hardware, and electricity will exceed the cost of the hardware over time. Keep in mind that hardware costs fall over time, so purchasing your equipment in stages rather than all at once may be desirable. To summarize, keep in mind these factors when mining with FPGAs: Adequate cooling Optimization Electricity costs Hardware cost per MH/s Benchmarks of mining speeds with different FPGAs As we have mentioned before, the Bitcoin network hash rate is really high now. Mining even with FPGAs does not guarantee profits. This is due to the fact that during the mining process, you are competing with other miners to try to solve a block. If those other miners are running a larger percentage of the total mining power, you will be at a disadvantage, as they are more likely to solve a block. To compare the mining speed of a few FPGAs, look at the following table: FPGA Mining speed (MH/s) Power used (Watts) Bitcoin Dominator X5000 100 6.8 Icarus 380 19.2 Lancelot 400 26 ModMiner Quad 800 40 Butterflylabs Mini Rig 25,200 1250 Comparison of the mining speed of different FPGAs FPGA versus GPU and CPU mining FPGAs hash much faster than any other hardware. The fastest in our list reaches 25,000 MH/s. FPGAs are faster at performing hashing calculations than both CPUs and GPUs. They are also more efficient with respect to the use of electricity per hashing unit. The increase in hashing speed in FPGAs is a significant improvement over GPUs and even more so over CPUs. The profitability of FPGA mining In calculating your potential profit, keep in mind the following factors: The cost of your FPGAs Electricity costs to run the hardware Cooling costs—FPGAs generate a decent amount of heat Your percentage of the total network hashing power To calculate the expected rewards from mining, we can do the following: First, calculate what percentage of total hashing power you command. To look up the network mining speed, execute the getmininginfo command in the console of the Bitcoin Core wallet. We will do our calculations with an FPGA that can hash at 1 GH/s. If the Bitcoin network hashes at 400,000 TH/s, then our proportion of the hashing power is 0.001/400 000 = 0.0000000025 of the total mining power. A bitcoin block is found, on average, every 10 minutes, which makes six per hour and 144 for a 24-hour period. The current reward per block is 25 BTC; therefore, in a day, we have 144 * 25 = 3600 BTC mined. If we command a certain percentage of the mining power, then on average we should earn that proportion of newly minted bitcoins. Multiplying our portion of the hashing power by the number of bitcoins mined daily, we arrive at the following: 0.0000000025 * 3600 BTC = 0.000009 BTC As one can see, this is roughly $0.0025 USD for a 24-hour period. For up-to-date profitability information, you can look at https://www.multipool.us/, which publishes the average profitability per gigahash of mining power. Summary In this article, we explored FPGA mining. We examined the advantages and disadvantages of mining with FPGAs. It would serve any miner well to ponder them over when deciding to start mining or when thinking about improving current mining operations. We touched upon some best practices that we recommend keeping in mind. We also investigated the profitability of mining, given current conditions. A simple way of calculating your average earnings was also presented. We concluded that mining competition is fierce; therefore, any improvements you can make will serve you well. Resources for Article:  Further resources on this subject:  Bitcoins – Pools and Mining [article] Protecting Your Bitcoins [article] E-commerce with MEAN [article]  
Read more
  • 0
  • 0
  • 5035

article-image-configuring-hbase
Packt
25 Jan 2016
14 min read
Save for later

Configuring HBase

Packt
25 Jan 2016
14 min read
In this article by Ruchir Choudhry, the author of the book HBase High Performance Cookbook, we will cover the configuration and deployment of HBase. (For more resources related to this topic, see here.) Introduction HBase is an open source, nonrelational, column-oriented distributed database modeled after Google's Cloud BigTable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project, and it runs on top of Hadoop Distributed File System (HDFS), providing BigTable-like capabilities for Hadoop. It's a column-oriented database, which is empowered by a fault-tolerant distributed file structure knows as HDFS. In addition to this, it also provides advanced features, such as auto sharding, load balancing, in-memory caching, replication, compression, near real-time lookups, strong consistency (using multiversions), block caches, and bloom filters for real-time queries and an array of client APIs. Throughout the chapter, we will discuss how to effectively set up mid and large size HBase clusters on top of the Hadoop and HDFS framework. This article will help you set up an HBase on a fully distributed cluster. For the cluster setup, we will consider redhat-6.2 Linux 2.6.32-220.el6.x86_64 #1 SMP Wed Nov 9 08:03:13 EST 2011 x86_64 x86_64 GNU/Linux, which will have six nodes. Configuration and Deployment Before we start HBase in a fully distributed mode, we will first be setting up Hadoop-2.4.0 in a distributed mode, and then, on top of a Hadoop cluster, we will set up HBase because it stores data in Hadoop Distributed File System (HDFS). Check the permissions of the users; HBase must have the ability to create a directory. Let's create two directories in which the data for NameNode and DataNode will reside: drwxrwxr-x 2 app app 4096 Jun 19 22:22 NameNodeData drwxrwxr-x 2 app app 4096 Jun 19 22:22 DataNodeData -bash-4.1$ pwd /u/HbaseB/hadoop-2.4.0 -bash-4.1$ ls -lh total 60K drwxr-xr-x 2 app app 4.0K Mar 31 08:49 bin drwxrwxr-x 2 app app 4.0K Jun 19 22:22 DataNodeData drwxr-xr-x 3 app app 4.0K Mar 31 08:49 etc Getting Ready Following are the steps to install and configure HBase: The first step to start is to choose a Hadoop cluster. Then, get the hardware details required for it. Get the software required to perform the setup. Get the OS required to do the setup. Perform the configuration steps. We will require the following components for NameNode: Components Details Type of systems An operating system redhat-6.2 Linux 2.6.32-220.el6.x86_64 #1 SMP Wed Nov 9 08:03:13 EST 2011 x86_64 x86_64 GNU/Linux, or other standard linux kernel.   Hardware/CPUS 16 to 24 CPUS cores. NameNode /Secondry NameNode. Hardware/RAM 64 to 128 GB. In special cases, 128 GB to 512 GB RAM. NameNode/Secondry NameNodes. Hardware/storage Both NameNode servers should have highly reliable storage for their namespace storage and edit log journaling. Typically, hardware RAID and/or reliable network storage are justifiable options. Note that the previous commands including an onsite disk replacement option in your support contract so that a failed RAID disk can be replaced quickly. NameNode/Secondry Namenodes.   RAID: Raid is nothing but a Random Access Inexpensive Drive or Independent Disk; there are many levels of RAID drives, but for Master or NameNode, RAID-1 will be enough. JBOD: This stands for Just a Bunch of Disk. The design is to have multiple hard drives stacked over each other with no redundancy. The calling software needs to take care of the failure and redundancy. In essence, it works as a single logical volume. The following screenshot shows the working mechanism of RAID and JBOD: Before we start for the cluster setup, a quick recap of the Hadoop setup is essential, with brief descriptions. How to do it… Let's create a directory where you will have all the software components to be downloaded: For simplicity, let's take this as /u/HbaseB. Create different users for different purposes. The format will be user/group; this is essentially required to differentiate various roles for specific purposes: HDFS/Hadoop: This is for the handling of Hadoop-related setups Yarn/Hadoop: This is for Yarn-related setups HBase/Hadoop Pig/Hadoop Hive/Hadoop Zookeeper/Hadoop HCat/Hadoop Set up directories for the Hadoop cluster: let's assume /u as a shared mount point; we can create specific directories, which will be used for specific purposes: -bash-4.1$ ls -ltr total 32 drwxr-xr-x 9 app app 4096 Oct 7 2013 hadoop-2.2.0 drwxr-xr-x 10 app app 4096 Feb 20 10:58 zookeeper-3.4.6 drwxr-xr-x 15 app app 4096 Apr 5 08:44 pig-0.12.1 drwxrwxr-x 7 app app 4096 Jun 30 00:57 hbase-0.98.3-hadoop2 drwxrwxr-x 8 app app 4096 Jun 30 00:59 apache-hive-0.13.1-bindrwxrwxr-x 7 app app 4096 Jun 30 01:04 mahout-distribution-0.9 Make sure that you have adequate privileges in the folder to add, edit, and execute a command. Also, you must set up password-less communication between different machines, such as from the name node to DataNode and from HBase Master to all the region server nodes. Refer to this webpage to learn how to do this: http://www.debian-administration.org/article/152/Password-less_logins_with_OpenSSH. Here, we will list the procedure to achieve the end result of the recipe. This section will follow a numbered bullet form. We do not need to explain the reason we are following a procedure. Numbered single sentences will do fine. Let's assume there is a /u directory and you have downloaded the entire stack of software from /u/HbaseB/hadoop-2.2.0/etc/hadoop/; look for the core-site.xml file. Place the following lines in this file: configuration> <property> <name>fs.default.name</name> <value>hdfs://mynamenode-hadoop:9001</value> <description>The name of the default file system. </description> </property> </configuration> You can specify a port that you want to use; it should not clash with the ports that are already in use by the system for various purposes. A quick look at this link can provide more specific details about this; complete detail on this topic is out of the scope of this book. You can refer to http://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers. Save the file. This helps us create a master/NameNode directory. Now let's move on to set up secondary nodes. Edit /u/HbaseB/hadoop-2.4.0/etc/hadoop/ and look for the core-site.xml file: <configuration> <property> <name>fs.checkpoint.dir</name> <value>/u/dn001/hadoop/hdf/secdn /u/dn002/hadoop/hdfs/secdn </value> <description>A comma separated list of paths. Use the list of directories from $FS_CHECKPOINT_DIR. example, /u/dn001/hadoop/hdf/secdn,/u/dn002/hadoop/hdfs/secd n </description> </property> </configuration> The separation of the directory structure is for the purpose of the clean separation of the hdfs block separation and to keep the configurations as simple as possible. This also allows us to do proper maintenance. Now let's move toward changing the setup for hdfs; the file location will be /u/HbaseB/hadoop-2.4.0/etc/hadoop/hdfs-site.xmlfor NameNode: <property> <name>dfs.name.dir</name> <value> /u/nn01/hadoop/hdfs/nn/u/nn02/hadoop/hdfs/nn </value> <description> Comma separated list of path, Use the list of directories </description> </property> for DataNode: <property> <name>dfs.data.dir</name> <value>/u/dnn01/hadoop/hdfs/dn,/u/dnn02/hadoop/hdfs/dn </value> <description>Comma separated list of path, Use the list of directories </description> </property> Now let's go for NameNode for the HTTP address or to NameNode using the HTTP protocol: <property> <name>dfs.http.address</name> <value>namenode.full.hostname:50070</value> <description>Enter your NameNode hostname for http access. </description> </property> The HTTP address for the secondary NameNode is as follows: <property> <name>dfs.secondary.http.address</name> <value> secondary.namenode.full.hostname:50090 </value> <description> Enter your Secondary NameNode hostname. </description> </property> We can go for an HTTPS setup for NameNode as well, but let's keep this optional for now: Now let's look for the Yarn setup in the /u/HbaseB/ hadoop-2.2.0/etc/hadoop/ yarn-site.xml file: For the resource tracker that's a part of the Yarn resource manager, execute the following code: <property> <name>yarn.resourcemanager.resourcetracker.address</name> <value>yarnresourcemanager.full.hostname:8025</value> <description>Enter your yarn Resource Manager hostname.</description> </property> For the resource schedule that's part of the Yarn resource scheduler, execute the following code: <property> <name>yarn.resourcemanager.scheduler.address</name> <value>resourcemanager.full.hostname:8030</value> <description>Enter your ResourceManager hostname</description> </property> For scheduler address, execute the following code: <property> <name>yarn.resourcemanager.address</name> <value>resourcemanager.full.hostname:8050</value> <description>Enter your ResourceManager hostname.</description> </property> For scheduler admin address, execute the following code: <property> <name>yarn.resourcemanager.admin.address</name> <value>resourcemanager.full.hostname:8041</value> <description>Enter your ResourceManager hostname.</description> </property> To set up the local directory, execute the following code: <property> <name>yarn.nodemanager.local-dirs</name> <value>/u/dnn01/hadoop/hdfs /yarn,/u/dnn02/hadoop/hdfs/yarn </value> <description>Comma separated list of paths. Use the list of directories from,.</description> </property> To set up the log location, execute the following code: <property> <name>yarn.nodemanager.logdirs</name> <value>/u/var/log/hadoop/yarn</value> <description>Use the list of directories from $YARN_LOG_DIR. <description> </property> This completes the configuration changes required for Yarn Now let's make the changes for MapReduce. Open /u/HbaseB/ hadoop-2.2.0/etc/hadoop/mapred-site.xml. Now let's place this configuration setup in mapred-site.xml and place this between <configuration></configuration>: <property> <name>mapreduce.jobhistory.address</name> <value>jobhistoryserver.full.hostname:10020</value> <description>Enter your JobHistoryServer hostname.</description> </property> Once we have configured MapReduce, we can move on to configuring HBase. Let's go to the /u/HbaseB/hbase-0.98.3-hadoop2/conf path and open the hbase-site.xml file. You will see a template that has <configuration></configurations>. We need to add the following lines between the starting and ending tags: <property> <name>hbase.rootdir</name> <value>hdfs://hbase.namenode.full.hostname:8020/apps/hbase/data</value> <description> Enter the HBase NameNode server hostname</description> </property> <property> <!—this id for binding address --> <name>hbase.master.info.bindAddress</name> <value>$hbase.master.full.hostname</value> <description>Enter the HBase Master server hostname</description> </property> This competes the HBase changes. ZooKeeper: Now let's focus on the setup of ZooKeeper. In distributed a environment, let's go to /u/HbaseB/zookeeper-3.4.6/conf locations, rename zoo_sample.cfg to zoo.cfg, and place the details as follows: yourzooKeeperserver.1=zoo1:2888:3888 yourZooKeeperserver.2=zoo2:2888:3888 If you want to test this setup locally, use different port combinations. Atomic broadcasting is an atomic messaging system that keeps all the servers in sync and provides reliable delivery, total orders, casual orders, and so on. Region servers: Before concluding, let's go to the region server setup process. Go to the /u/HbaseB/hbase-0.98.3-hadoop2/conf folder and edit the regionserver file. Specify the region servers accordingly: RegionServer1 RegionServer2 RegionServer3 RegionServer4 Copy all the configuration files of Hbase and ZooKeeper to the relative host dedicated for Hbase and ZooKeeper. Let's quickly validate the setup that we worked on: Sudo su $HDFS_USER /u/HbaseB/hadoop-2.2.0/bin/hadoop namenode -format /u/HbaseB/hadoop-2.4.0/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR start namenode Now let's go to the secondary nodes: Sudo su $HDFS_USER /u/HbaseB/hadoop-2.2.0/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR start secondarynamenode Now let's perform all the steps for DataNode: Sudo su $HDFS_USER /u/HbaseB/hadoop-2.2.0/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR start datanode Test 01> See if you can reach from your browser http://namenode.full.hostname:50070 Test 02> sudo su $HDFS_USER /u/HbaseB/hadoop-2.2.0/sbin/hadoop dfs -copyFromLocal /tmp/hello.txt /u/HbaseB/hadoop-2.2.0/sbin/hadoop dfs –ls you must see hello.txt once the command executes. Test 03> Browse http://datanode.full.hostname:50075/browseDirectory.jsp?namenodeInfoPort=50070&dir=/&nnaddr=$datanode.full.hostname:8020 you should see the details on the datanode. Validate the Yarn and MapReduce setup by following these steps: Execute the command from Resource Manager: <login as $YARN_USER and source the directories.sh companion script> /u/HbaseB/hadoop-2.2.0/sbin /yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager Execute the command from Node Manager <login as $YARN_USER and source the directories.sh companion script> /usr/lib/hadoop-yarn/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start nodemanager Execute the following commands: hadoop fs -mkdir /app-logs hadoop fs -chown $YARN_USER /app-logs hadoop fs -chmod 1777 /app-logs Execute MapReduce Sudo su $HDFS_USER /u/HbaseB/hadoop-2.2.0/sbin/hadoop fs -mkdir -p /mapred/history/done_intermediate /u/HbaseB/hadoop-2.2.0/sbin/hadoop fs -chmod -R 1777 /mapred/history/done_intermediate /u/HbaseB/hadoop-2.2.0/sbin/hadoop fs -mkdir -p /mapred/history/done /u/HbaseB/hadoop-2.2.0/sbin/hadoop fs -chmod -R 1777 /mapred/history/done /u/HbaseB/hadoop-2.2.0/sbin/hadoop fs -chown -R mapred /mapred export HADOOP_LIBEXEC_DIR=/u/HbaseB/hadoop-2.2.0/libexec/ export HADOOP_MAPRED_HOME=/=/u/HbaseB/hadoop-2.2.0/hadoop-mapreduceexport HADOOP_MAPRED_LOG_DIR==/u/HbaseB/hadoop-2.2.0//mapred Start the jobhistory servers: <login as $MAPRED_USER and source the directories.sh companion script> /u/HbaseB/hadoop-2.2.0/sbin/mr-jobhistory-daemon.sh start historyserver --config $HADOOP_CONF_DIR Test 01: from the browser or from curl use the link to browse. http://resourcemanager.full.hostname:8088/ Test 02: Sudo su $HDFS_USER /u/HbaseB/hadoop-2.2.0/bin/hadoop jar /u/HbaseB/hadoop-2.2.0/hadoop-mapreduce/hadoop-mapreduce-examples-2.0.2.1-alpha.jar teragen 100 /test/10gsort/input /u/HbaseB/hadoop-2.2.0/bin/hadoop jar /u/HbaseB/hadoop-2.2.0/hadoop-mapreduce/hadoop-mapreduce-examples-2.0.2.1-alpha.jar Validate the HBase setup:     Login as $HDFS_USER /u/HbaseB/hadoop-2.2.0/bin/hadoop fs –mkdir /apps/hbase /u/HbaseB/hadoop-2.2.0/bin/hadoop fs –chown –R /apps/hbase      Now login as $HBASE_USER /u/HbaseB/hbase-0.98.3-hadoop2/bin/hbas-daemon.sh –-config $HBASE_CONF_DIR start master this will start the master node      Now let’s move to HBase Region server nodes: /u/HbaseB/hbase-0.98.3-hadoop2/bin/hbase-daemon.sh –config $HBASE_CONF_DIR start regionservers this will start the regionservers For single machine direct sudo ./hbase master start can also be used. Please check the logs in case of any logs. Now lets login using Sudo su- $HBASE_USER ./hbase shell will connect us to the hbase to the master. Validate the ZooKeeper setup: -bash-4.1$ sudo ./zkServer.sh start JMX enabled by default Using config: /u/HbaseB/zookeeper-3.4.6/bin/../conf/zoo.cfg Starting zookeeper ... STARTED You can also pipe the log to the ZooKeeper logs. /u/logs//u/HbaseB/zookeeper-3.4.6/zoo.out 2>&1 Summary In this article, we learned how to configure and set up HBase. We set up HBase to store data in Hadoop Distributed File System. We explored the working structure of RAID and JBOD and the differences between both filesystems. Resources for Article: Further resources on this subject: Understanding the HBase Ecosystem[article] The HBase's Data Storage[article] HBase Administration, Performance Tuning[article]
Read more
  • 0
  • 0
  • 2292

article-image-integration-hadoop
Packt
18 Jan 2016
18 min read
Save for later

Integration with Hadoop

Packt
18 Jan 2016
18 min read
In this article by Cyrus Dasadia, author of the book, MongoDB Cookbook Second Edition, we will cover the following recipes: Executing our first sample MapReduce job using the mongo-hadoop connector Writing our first Hadoop MapReduce job (For more resources related to this topic, see here.) Hadoop is a well-known open source software for the processing of large datasets. It also has an API for the MapReduce programming model, which is widely used. Nearly all the big data solutions have some sort of support to integrate them with Hadoop in order to use its MapReduce framework. MongoDB too has a connector that integrates with Hadoop and lets us write MapReduce jobs using the Hadoop MapReduce API, process the data residing in the MongoDB/MongoDB dumps, and write the result back to the MongoDB/MongoDB dump files. In this article, we will be looking at some recipes around the basic MongoDB and Hadoop integration. Executing our first sample MapReduce job using the mongo-hadoop connector In this recipe, we will see how to build the Mongo-Hadoop connector from the source and set up Hadoop just for the purpose of running the examples in a standalone mode. The connector is the backbone that runs Hadoop MapReduce jobs on Hadoop using the data in Mongo. Getting ready There are various distributions of Hadoop; however, we will use Apache Hadoop (http://hadoop.apache.org/). The installation will be done on Ubuntu Linux. For production, Apache Hadoop always runs on the Linux environment and Windows is not tested for production systems. For development purposes, however, Windows can be used. If you are a Windows user, I would recommend installing a virtualization environment such as VirtualBox (https://www.virtualbox.org/), set up a Linux environment, and then install Hadoop on it. Setting up VirtualBox and Linux on it is not shown in this recipe, but this is not a tedious task. The prerequisite for this recipe is a machine with a Linux operating system on it and an Internet connection. The version that we will set up here is 2.4.0 of Apache Hadoop. The latest version of Apache Hadoop that is supported by the mongo-hadoop connector is 2.4.0. A Git client is needed to clone the repository of the mongo-hadoop connector to the local filesystem. Refer to http://git-scm.com/book/en/Getting-Started-Installing-Git to install Git. You will also need MongoDB to be installed on your operating system. Refer to http://docs.mongodb.org/manual/installation/ and install it accordingly. Start the mongod instance listening to port 27017. It is not expected for you to be an expert in Hadoop but some familiarity with it will be helpful. Knowing the concept of MapReduce is important and knowing about the Hadoop MapReduce API will be an advantage. In this recipe, we will be explaining what is needed to get the work done. You can get more details on Hadoop and its MapReduce API from other sources. The Wikipedia page, http://en.wikipedia.org/wiki/MapReduce, gives good enough information about the MapReduce programming. How to do it… We will first install Java, Hadoop, and the required packages. We will start with installing JDK on the operating system.  Type the following on the command prompt of the operating system: $ javac –version If the program doesn't execute and you are told about the various packages that contain javac and program, then we need to install them as follows: $ sudo apt-get install default-jdk This is all we need to do to install Java Visit the URL, http://www.apache.org/dyn/closer.cgi/hadoop/common/, and download version 2.4 (or the latest mongo-hadoop connector supports). After the .tar.gz file has been downloaded, execute the following on the command prompt: $ tar –xvzf <name of the downloaded .tar.gz file> $ cd <extracted directory> Open the etc/hadoop/hadoop-env.sh file and replace export JAVA_HOME = ${JAVA_HOME} with export JAVA_HOME = /usr/lib/jvm/default-java. We will now get the mongo-hadoop connector code from GitHub on our local filesystem. Note that you don't need a GitHub account to clone a repository. Clone the git project from the operating system command prompt as follows: $git clone https://github.com/mongodb/mongo-hadoop.git $cd mongo-hadoop Create a soft link; the Hadoop installation directory is the same as the one that we extracted in step 3: $ln –s <hadoop installation directory> ~/hadoop-binaries For example, if Hadoop is extracted/installed in the home directory, then this is the command to be executed: $ln –s ~/hadoop-2.4.0 ~/hadoop-binaries By default, the mongo-hadoop connector will look for a Hadoop distribution in the ~/hadoop-binaries folder. So, even if the Hadoop archive is extracted elsewhere, we can create a soft link to it. Once this link is created, we should have the Hadoop binaries in the ~/hadoop-binaries/hadoop-2.4.0/bin path. We will now build the mongo-hadoop connector from the source for the Apache Hadoop version 2.4.0. The build-by-default builds for the latest version, so as of now, the -Phadoop_version parameter can be left out as 2.4 is the latest anyways. $./gradlew jar –Phadoop_version='2.4' This build process would take some time to get completed. Once the build is completed successfully, we are ready to execute our first MapReduce job. We will be doing it using a treasuryYield sample provided with the mongo-hadoop connector project. The first activity is to import the data to a collection in Mongo. Assuming that the mongod instance is up and running and listening to port 27017 for connections and the current directory is the root of the mongo-hadoop connector code base, execute the following command: $ mongoimport -c yield_historical.in -d mongo_hadoop --drop examples/treasury_yield/src/main/resources/yield_historical_in.json Once the import action is successful, we are left with copying two JAR files to the lib directory. Execute the following from the operating system shell: $ wget http://repo1.maven.org/maven2/org/mongodb/mongo-java-driver/2.12.0/mongo-java-driver-2.12.0.jar $ cp core/build/libs/mongo-hadoop-core-1.2.1-SNAPSHOT-hadoop_2.4.jar ~/hadoop-binaries/hadoop-2.4.0/lib/ $ mv mongo-java-driver-2.12.0.jar ~/hadoop-binaries/hadoop-2.4.0/lib The jar built for the mongo-hadoop core to be copied was named as above for the trunk version of the code and built for hadoop-2.4.0. Change the name of the JAR accordingly when you build it yourselves for a different version of the connector and Hadoop. The Mongo driver can be the latest version. The version 2.12.0 is the latest version. Now, execute the following command on the command prompt of the operating system shell:  ~/hadoop-binaries/hadoop-2.4.0/bin/hadoop     jar     examples/treasury_yield/build/libs/treasury_yield-1.2.1-SNAPSHOT-hadoop_2.4.jar  com.mongodb.hadoop.examples.treasury.TreasuryYieldXMLConfig  -Dmongo.input.split_size=8     -Dmongo.job.verbose=true  -Dmongo.input.uri=mongodb://localhost:27017/mongo_hadoop.yield_historical.in  -Dmongo.output.uri=mongodb://localhost:27017/mongo_hadoop.yield_historical.out The output should print out a lot of things; however, the following line in the output should tell us that the MapReduce job is successful:  14/05/11 21:38:54 INFO mapreduce.Job: Job job_local1226390512_0001 completed successfully Connect the mongod instance running on localhost from the mongo client and execute a find on the following collection: $ mongo > use mongo_hadoop switched to db mongo_hadoop > db.yield_historical.out.find() How it works… Installing Hadoop is not a trivial task and we don't need to get into this to try our samples for the mongo-hadoop connector. To learn about Hadoop and its installation, there are dedicated books and articles available. For the purpose of this article, we will simply be downloading the archive and extracting and running the MapReduce jobs in a standalone mode. This is the quickest way to get going with Hadoop. All the steps up to step 6 are needed to install Hadoop. In the next couple of steps, we will simply clone the mongo-hadoop connector recipe. You can also download a stable, built version for your version of Hadoop from https://github.com/mongodb/mongo-hadoop/releases if you prefer not to build from the source. We will then build the connector for our version of Hadoop (2.4.0) till step 13. From step 14 onwards is what we will do to run the actual MapReduce job in order to work on the data in MongoDB. We imported the data to the yield_historical.in collection, which would be used as an input to the MapReduce job. Go ahead and query the collection from the Mongo shell using the mongo_hadoop database to see a document. Don't worry if you don't understand the contents; we want to see in this example what we intend to do with this data. The next step was to invoke the MapReduce operation on the data. The hadoop command was executed giving one jar's path, (examples/treasury_yield/build/libs/treasury_yield-1.2.1-SNAPSHOT-hadoop_2.4.jar). This is the jar that contains the classes implementing a sample MapReduce operation for the treasury yield. The com.mongodb.hadoop.examples.treasury.TreasuryYieldXMLConfig class in this JAR file is the bootstrap class containing the main method. We will visit this class soon. There are lots of configurations supported by the connector.For now, we will just remember that mongo.input.uri and mongo.output.uri are the collections for the input and output for the MapReduce operations. With the project cloned, you can import it to any Java IDE of your choice. We are particularly interested in the project at /examples/treasury_yield and the core present in the root of the cloned repository. Let's look at the com.mongodb.hadoop.examples.treasury.TreasuryYieldXMLConfig class. This is the entry point to the MapReduce method and has a main method in it. To write MapReduce jobs for mongo using the mongo-hadoop connector, the main class always has to extend from com.mongodb.hadoop.util.MongoTool. This class implements the org.apache.hadoop.Tool interface, which has the run method and is implemented for us by the MongoTool class. All that the main method needs to do is execute this class using the org.apache.hadoop.util.ToolRunner class by invoking its static run method passing the instance of our main class (which is an instance of Tool). There is a static block that loads the configurations from two XML files, hadoop-local.xml and mongo-defaults.xml. The format of these files (or any XML file) is as follows. The root node of the file is the configuration node and multiple property nodes under it. <configuration>   <property>     <name>{property name}</name>     <value>{property value}</value>   </property>   ... </configuration> The property values that make sense in this context are all those that we mentioned in the URL provided earlier. We instantiate com.mongodb.hadoop.MongoConfig wrapping an instance of org.apache.hadoop.conf.Configuration in the constructor of the bootstrap class, TreasuryYieldXmlConfig. The MongoConfig class provides sensible defaults that are enough to satisfy the majority of the use cases. Some of the most important things that we need to set in the MongoConfig instance are to set the output and input format, mapper and reducer classes, output key, and value of the mapper, output key, and reducer. The input format and output format will always be the com.mongodb.hadoop.-MongoInputFormat and com.mongodb.hadoop.MongoOutputFormat classes, which are provided by the mongo-hadoop connector library. For the mapper and reducer output key and its value, we have the org.apache.hadoop.io.Writable implementation. Refer to the Hadoop documentation for different types of writable implementations in the org.apache.hadoop.io package. Apart from these, the mongo-hadoop connector also provides us with some implementations in the com.mongodb.hadoop.io package. For the treasury yield example, we used the BSONWritable instance. These configurable values can either be provided in the XML file that we saw earlier or be programmatically set. Finally, we have the option to provide them as vm arguments that we did for mongo.input.uri and mongo.output.uri. These parameters can be provided either in the XML or invoked directly from the code on the MongoConfig instance; the two methods are setInputURI and setOutputURI, respectively. We will now look at the mapper and reducer class implementation. We will copy the important portion of the class here to analyze. Refer to the cloned project for the entire implementation. public class TreasuryYieldMapper     extends Mapper<Object, BSONObject, IntWritable, DoubleWritable> {       @Override     public void map(final Object pKey,                     final BSONObject pValue,                     final Context pContext)         throws IOException, InterruptedException {         final int year = ((Date) pValue.get("_id")).getYear() + 1900;         double bid10Year = ((Number) pValue.get("bc10Year")).doubleValue();         pContext.write(new IntWritable(year), new DoubleWritable(bid10Year));     } } Our mapper extends the org.apache.hadoop.mapreduce.Mapper class. The four generic parameters are for the key class, type of the input value, type of the output key, and output value. The body of the map method reads the _id value from the input document, which is the date and extracts the year out of it. Then, it gets the double value from the document for the bc10Year field and simply writes to the context key value pair, where the key is the year and the value is the double. The implementation here doesn't rely on the value of the pKey parameter passed, which can be used as the key instead of hardcoding the _id value in the implementation. This value is basically the same field that would be set using the mongo.input.key property in XML or using the MongoConfig.setInputKey method. If none is set, _id is anyways the default value. Let's look at the reducer implementation (with the logging statements removed): public class TreasuryYieldReducer     extends Reducer<IntWritable, DoubleWritable, IntWritable, BSONWritable> {       @Override     public void reduce(final IntWritable pKey, final Iterable<DoubleWritable> pValues, final Context pContext)         throws IOException, InterruptedException {         int count = 0;         double sum = 0;         for (final DoubleWritable value : pValues) {             sum += value.get();             count++;         }         final double avg = sum / count;         BasicBSONObject output = new BasicBSONObject();         output.put("count", count);         output.put("avg", avg);         output.put("sum", sum);         pContext.write(pKey, new BSONWritable(output));     } } This class extends from org.apache.hadoop.mapreduce.Reducer and has four generic parameters again for the input key, input value, output key, and output value. The input to the reducer is the output from the mapper, and if you notice carefully, the type of the first two generic parameters are the same as the last two generic parameters of the mapper that we saw earlier. The third and fourth parameters in this case are the type of the key and the value emitted from the reducer. The type of the value is BSONDocument, and thus we have BSONWritable as the type. We now have the reduce method that has two parameters: the first one is the key, which is same as the key emitted from the map function, and the second parameter is java.lang.Iterable of the values emitted for the same key. This is how standard MapReduce functions work. For instance, if the map function gave the following key value pairs, (1950, 10), (1960, 20), (1950, 20), (1950, 30), then reduce will be invoked with two unique keys, 1950 and 1960, and the values for the key 1950 will be an iterable with (10, 20, 30), where the value of 1960 will be an iterable of a single element, (20). The reducer's reduce function simply iterates though this iterable of doubles, finds the sum and count of these numbers, and writes one key value pair, where the key is the same as the incoming key and the output value is BasicBSONObject with the sum, count, and average in it for the computed values. There are some good samples, including the enron dataset, in the examples of the cloned mongo-hadoop connector. If you would like to play around a bit, I would recommend that you to take a look at these example projects too and run them. There's more… What we saw here is a readymade sample that we executed. There is nothing like writing one MapReduce job ourselves for our understanding. In the next recipe, we will write one sample MapReduce job using the Hadoop API in Java and see it in action. See also If you're wondering what the writable interface is all about and why you should not use plain old serialization instead, then refer to this URL, which gives the explanation by the creator of Hadoop himself: http://www.mail-archive.com/[email protected]/msg00378.html Writing our first Hadoop MapReduce job In this recipe, we will write our first MapReduce job using the Hadoop MapReduce API and run it using the mongo-hadoop connector getting the data from MongoDB. Getting ready Refer to the previous recipe, Executing our first sample MapReduce job using mongo-hadoop connector, for the setting up of the mongo-hadoop connector. This is a maven project and thus maven needs to be set up and installed. This project, however, is built on Ubuntu Linux and you need to execute the following command from the operating system shell to get maven: $ sudo apt-get install maven How to do it… We have a Java mongo-hadoop-mapreduce-test project that can be downloaded from the Packt site. We invoked that MapReduce job using the Python and Java client on previous occasions. With the current directory at the root of the project where the pom.xml file is present, execute the following command on the command prompt: $ mvn clean package The JAR file, mongo-hadoop-mapreduce-test-1.0.jar, will be built and kept in the target directory. With the assumption that the CSV file is already imported to the postalCodes collection, execute the following command with the current directory still at the root of the mongo-hadoop-mapreduce-test project that we just built: ~/hadoop-binaries/hadoop-2.4.0/bin/hadoop  jar target/mongo-hadoop-mapreduce-test-1.0.jar  com.packtpub.mongo.cookbook.TopStateMapReduceEntrypoint  -Dmongo.input.split_size=8 -Dmongo.job.verbose=true -Dmongo.input.uri=mongodb://localhost:27017/test.postalCodes -Dmongo.output.uri=mongodb://localhost:27017/test.postalCodesHadoopmrOut Once the MapReduce job is completed, open the Mongo shell by typing the following command on the operating system command prompt and execute the following query from the shell: $ mongo > db.postalCodesHadoopmrOut.find().sort({count:-1}).limit(5) Compare the output to the ones that we got earlier when we executed the MapReduce jobs using Mongo's MapReduce framework. How it works… We have kept the classes very simple and with the bare minimum things that we needed. We just have three classes in our project, TopStateMapReduceEntrypoint, TopStateReducer, and TopStatesMapper, all in the same com.packtpub.mongo.cookbook package. The mapper's map function just writes a key value pair to the context, where the key is the name of the state and value is an integer value, 1. The following is the code snippet from the mapper function: context.write(new Text((String)value.get("state")), new IntWritable(1)); What the reducer gets is the same key that is the list of states and an iterable of an integer value, 1. All we do is write the same name of the state and the sum of the iterables to the context. Now, as there is no size method in the iterable that can give the count in constant time, we are left with adding up all the ones that we get in the linear time. The following is the code in the reducer method: int sum = 0; for(IntWritable value : values) {   sum += value.get(); } BSONObject object = new BasicBSONObject(); object.put("count", sum); context.write(text, new BSONWritable(object)); We will write the text string that is the key and the value that is the JSON document containing the count to the context. The mongo-hadoop connector is then responsible for writing to the output collection that we have postalCodesHadoopmrOut, the document with the _id field same as the key emitted. Thus, when we execute the following, we get the top five states with the most number of cities in our database: > db. postalCodesHadoopmrOut.find().sort({count:-1}).limit(5) { "_id" : "Maharashtra", "count" : 6446 } { "_id" : "Kerala", "count" : 4684 } { "_id" : "Tamil Nadu", "count" : 3784 } { "_id" : "Andhra Pradesh", "count" : 3550 } { "_id" : "Karnataka", "count" : 3204 } Finally, the main method of the main entry point class is as follows: Configuration conf = new Configuration(); MongoConfig config = new MongoConfig(conf); config.setInputFormat(MongoInputFormat.class); config.setMapperOutputKey(Text.class); config.setMapperOutputValue(IntWritable.class); config.setMapper(TopStatesMapper.class); config.setOutputFormat(MongoOutputFormat.class); config.setOutputKey(Text.class); config.setOutputValue(BSONWritable.class); config.setReducer(TopStateReducer.class); ToolRunner.run(conf, new TopStateMapReduceEntrypoint(), args); All we do is wrap the org.apache.hadoop.conf.Configuration object with the com.mongodb.hadoop.MongoConfig instance to set the various properties and then submit the MapReduce job for the execution using ToolRunner. See also We executed a simple MapReduce job on Hadoop using the Hadoop API and sourcing the data from MongoDB and writing the data to the MongoDB collection in the recipe. What if we want to write the map and reduce the functions in a different language? Fortunately, this is possible by using a concept called Hadoop streaming, where stdout is used as a means to communicate between the program and the Hadoop MapReduce framework. Summary In this article, you learned about executing our first sample MapReduce job using the mongo-Hadoop connector and writing our first Hadoop MapReduce job. You can also refer to the following books related to MongoDB that are available on our website: MongoDB Cookbook: https://www.packtpub.com/big-data-and-business-intelligence/mongodb-cookbook Instant MongoDB: https://www.packtpub.com/big-data-and-business-intelligence/instant-mongodb-instant MongoDB High Availability: https://www.packtpub.com/big-data-and-business-intelligence/mongodb-high-availability Resources for Article: Further resources on this subject: About MongoDB [article] Ruby with MongoDB for Web Development [article] Sharding in Action [article]
Read more
  • 0
  • 1
  • 2237
article-image-controlling-relevancy
Packt
18 Jan 2016
19 min read
Save for later

Controlling Relevancy

Packt
18 Jan 2016
19 min read
In this article written by Bharvi Dixit, author of the book Elasticsearch Essentials, we understand that getting a search engine to behave can be very hard. It does not matter if you are a newbie or have years of experience with Elasticsearch or Solr, you must have definitely struggled with low-quality search results in your application. The default algorithm of Lucene does not come close to meeting your requirements, and there is always a struggle to deliver the relevant search results. We will be covering the following topics: (For more resources related to this topic, see here.) Introducing relevant search Out of the Box Tools from Elasticsearch Controlling relevancy with custom scoring Introducing relevant search Relevancy is the root of a search engine's value proposition and can be defined as the art of ranking content for a user's search based on how much that content satisfies the needs of the user or the business. In an application, it does not matter how beautiful your user interface looks or how many functionalities you are providing to the user; search relevancy cannot be avoided at any cost. So, despite of the mystical behavior of search engines, you have to find a solution to get the relevant results. The relevancy becomes more important because a user does not care about the whole bunch of documents that you have. The user enters his keywords, selects filters, and focuses on a very small amount of data—the relevant results. And if your search engine fails to deliver according to expectations, the user might be annoyed, which might be a loss for your business. A search engine like Elasticsearch comes with a built-in intelligence. You enter the keyword and within a blink of an eye, it returns to you the results that it thinks are relevant according to its intelligence. However, Elasticsearch does not a built-in intelligence according to your application domain. The relevancy is not defined by a search engine; rather it is defined by your users, their business needs, and the domains. Take an example of Google or Twitter, they have put in years of engineering experience, but still fail occasionally while providing relevancy. Don't they? Further, the challenges of search differ with the domain: the search on an e-commerce platform is about driving sales and bringing positive customer outcomes, whereas in fields such as medicine, it is about the matter of life and death. The lives of search engineers become more complicated because they do not have domain-specific knowledge, which can be used to understand the semantics of user queries. However, despite of all the challenges, the implementation of search relevancy is up to you, and it depends on what information you can extract from the users, their queries, and the content they see. We continuously take feedbacks from the users, create funnels, or enable loggings to capture the search behavior of the users so that we can improve our algorithms to provide the relevant results. The Elasticsearch out-of-the-box tools Elasticsearch primarily works with two models of information retrieval: the Boolean model and the Vector Space model. In addition to these, there are other scoring algorithms available in Elasticsearch as well, such as Okapi BM25, Divergence from Randomness (DFR), and Information Based (IB). Working with these three models requires an extensive mathematical knowledge and needs some extra configurations in Elasticsearch. The Boolean model uses the AND, OR, and NOT conditions in a query to find all the matching documents. This Boolean model can be further combined with the Lucene scoring formula, TF/IDF, to rank documents. The Vector Space model works differently from the Boolean model, as it represents both queries and documents as vectors. In the vector space model, each number in the vector is the weight of a term that is calculated using TF/IDF. The queries and documents are compared using a cosine similarity in which angles between two vectors are compared to find the similarity, which ultimately leads to finding the relevancy of the documents. An example: why defaults are not enough Let's build an index with sample documents to understand the examples in a better way. First, create an index with the name profiles: curl -XPUT 'localhost:9200/profiles' Then, put the mapping with the document type as candidate: curl -XPUT 'localhost:9200/profiles/candidate' {  "properties": {    "geo_code": {      "type": "geo_point",      "lat_lon": true    }  } } Please note that in preceding mapping, we are putting mapping only for the geo data type. The rest of the fields will be indexed dynamically. Now, you can create a data.json file with the following content in it: { "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 1 }} { "name" : "Sam", "geo_code" : "12.9545163,77.3500487", "total_experience":5, "skills":["java","python"] } { "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 2 }} { "name" : "Robert", "geo_code" : "28.6619678,77.225706", "total_experience":2, "skills":["java"] } { "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 3 }} { "name" : "Lavleen", "geo_code" : "28.6619678,77.225706", "total_experience":4, "skills":["java","Elasticsearch"] } { "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 4 }} { "name" : "Bharvi", "geo_code" : "28.6619678,77.225706", "total_experience":3, "skills":["java","lucene"] } { "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 5 }} { "name" : "Nips", "geo_code" : "12.9545163,77.3500487", "total_experience":7, "skills":["grails","python"] } { "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 6 }} { "name" : "Shikha", "geo_code" : "28.4250666,76.8493508", "total_experience":10, "skills":["c","java"] }  If you are indexing skills, which are separated by spaces or which include non-English characters, that is, c++, c#, or core java, you need to create mapping for the skills field as not_analyzed in advance to have exact term matching. Once the file is created, execute the following command to put the data inside the index we have just created: curl -XPOST 'localhost:9200' --data-binary @data.json If you look carefully at the example, the documents contain the data of the candidates who might be looking for jobs. For hiring candidates, a recruiter can have the following criteria: Candidates should know about Java Candidate should have an experience between 3 to 5 years Candidate should fall in the distance range of 100 kilometers from the office of the recruiter. You can construct a simple bool query in combination with a term query on the skills field along with geo_distance and range filters on the geo_code and total_experience fields respectively. However, does this give a relevant set of results? The answer would be NO. The problem is that if you are restricting the range of experience and distance, you might even get zero results or no suitable candidate. For example, you can put a range of 0 to 100 kilometers of distance but your perfect candidate might be at a distance of 101 kilometers. At the same time, if you define a wide range, you might get a huge number of non-relevant results. The other problem is that if you search for candidates who know Java, there are chances that a person who knows only Java and not any other programming language will be at the top, while a person who knows other languages apart from Java will be at the bottom. This happens because during the ranking of documents with TF/IDF, the lengths of the fields are taken into account. If the length of a field is small, the document is more relevant. Elasticsearch is not intelligent enough to understand the semantic meaning of your queries but for these scenarios, it offers you the full power to redefine how scoring and document ranking should be done. Controlling relevancy with custom scoring In most cases, you are good to go with the default scoring algorithms of Elasticsearch to return the most relevant results. However, some cases require you to have more control on the calculation of a score. This is especially required while implementing a domain-specific logic such as finding the relevant candidates for a job, where you need to implement a very specific scoring formula. Elasticsearch provides you with the function_score query to take control of all these things. Here we cover the code examples only in Java because a Python client gives you the flexibility to pass the query inside the body parameter of a search function. Python programmers can simply use the example queries in the same way. There is no extra module required to execute these queries. function_score query Function score query allows you to take the complete control of how a score needs to be calculated for a particular query: Syntax of a function_score query: {   "query": {"function_score": {     "query": {},     "boost": "boost for the whole query",     "functions": [       {}     ],     "max_boost": number,     "score_mode": "(multiply|max|...)",     "boost_mode": "(multiply|replace|...)",     "min_score" : number   }} } The function_score query has two parts: the first is the base query that finds the overall pool of results you want. The second part is the list of functions, which are used to adjust the scoring. These functions can be applied to each document that matches the main query in order to alter or completely replace the original query _score. In a function_score query, each function is composed of an optional filter that tells Elasticsearch which records should have their scores adjusted (defaults to "all records") and a description of how to adjust the score. The other parameters that can be used with a functions_score query are as follows: boost: An optional parameter that defines the boost for the entire query. max_boost: The maximum boost that will be applied by a function score. boost_mode: An optional parameter, which defaults to multiply. Score mode defines how the combined result of the score functions will influence the final score together with the subquery score. This can be replace (only the function score is used, the query score is ignored), max (the maximum of the query score and the function score), min (the minimum of the query score and the function score), sum (the query score and the function score are added), avg, or multiply (the query score and the function score are multiplied). score_mode: This parameter specifies how the results of individual score functions will be aggregated. The possible values can be first (the first function that has a matching filter is applied), avg, max, sum, min, and multiply. min_score: The minimum score to be used. Excluding Non-Relevant Documents with min_score To exclude documents that do not meet a certain score threshold, the min_score parameter can be set to the desired score threshold. The following are the built-in functions that are available to be used with the function score query: weight field_value_factor script_score The decay functions—linear, exp, and gauss Let's see them one by one and then you will learn how to combine them in a single query. weight A weight function allows you to apply a simple boost to each document without the boost being normalized: a weight of 2 results in 2 * _score. For example: GET profiles/candidate/_search {   "query": {     "function_score": {       "query": {         "term": {           "skills": {             "value": "java"           }         }       },       "functions": [         {           "filter": {             "term": {               "skills": "python"             }           },           "weight": 2         }       ],       "boost_mode": "replace"     }   } } The preceding query will match all the candidates who know Java, but will give a higher score to the candidates who also know Python. Please note that boost_mode is set to replace, which will cause _score to be calculated by a query that is to be overridden by the weight function for our particular filter clause. The query output will contain the candidates on top with a _score of 2 who know both Java and Python. Java example The previous query can be implemented in Java in the following way: First, you need to import the following classes into your code: import org.elasticsearch.action.search.SearchResponse; import org.elasticsearch.client.Client; import org.elasticsearch.index.query.QueryBuilders; import org.elasticsearch.index.query.functionscore.FunctionScoreQueryBuilder; import org.elasticsearch.index.query.functionscore.ScoreFunctionBuilders; Then the following code snippets can be used to implement the query: FunctionScoreQueryBuilder functionQuery = new FunctionScoreQueryBuilder(QueryBuilders.termQuery("skills", "java"))     .add(QueryBuilders.termQuery("skills", "python"),   ScoreFunctionBuilders.weightFactorFunction(2)).boostMode("replace");   SearchResponse response = client.prepareSearch().setIndices(indexName)         .setTypes(docType).setQuery(functionQuery)         .execute().actionGet(); field_value_factor It uses the value of a field in the document to alter the _score: GET profiles/candidate/_search {   "query": {     "function_score": {       "query": {         "term": {           "skills": {             "value": "java"           }         }       },       "functions": [         {           "field_value_factor": {             "field": "total_experience"           }         }       ],       "boost_mode": "multiply"     }   } } The preceding query finds all the candidates with java in their skills, but influences the total score depending on the total experience of the candidate. So, the more experience the candidate will have, the higher ranking he will get. Please note that boost_mode is set to multiply, which will yield the following formula for the final scoring: _score = _score * doc['total_experience'].value However, there are two issues with the preceding approach: first are the documents that have the total experience value as 0 and will reset the final score to 0. Second, Lucene _score usually falls between 0 and 10, so a candidate with an experience of more than 10 years will completely swamp the effect of the full text search score. To get rid of this problem, apart from using the field parameter, the field_value_factor function provides you with the following extra parameters to be used: factor: This is an optional factor to multiply the field value with. This defaults to 1. modifier: This is a mathematical modifier to apply to the field value. This can be :none, log, log1p, log2p, ln, ln1p, ln2p, square, sqrt, or reciprocal. It defaults to none. Java example The preceding query can be implemented in Java in the following way: First, you need to import the following classes into your code: import org.elasticsearch.action.search.SearchResponse; import org.elasticsearch.client.Client; import org.elasticsearch.index.query.QueryBuilders; import org.elasticsearch.index.query.functionscore*; Then the following code snippets can be used to implement the query: FunctionScoreQueryBuilder functionQuery = new FunctionScoreQueryBuilder(QueryBuilders.termQuery("skills", "java"))     .add(new FieldValueFactorFunctionBuilder("total_experience")).boostMode("multiply");   SearchResponse response = client.prepareSearch().setIndices("profiles")         .setTypes("candidate").setQuery(functionQuery)         .execute().actionGet(); script_score script_score is the most powerful function available in Elasticsearch. It uses a custom script to take complete control of the scoring logic. You can write a custom script to implement the logic you need. Scripting allows you to write from a simple to very complex logic. Scripts are cached, too, to allow faster executions of repetitive queries. Let's see an example: {   "script_score": {     "script": "doc['total_experience'].value"   } } Look at the special syntax to access the field values inside the script parameter. This is how the value of the fields is accessed using groovy scripting language. Scripting is, by default, disabled in Elasticsearch, so to use script score functions, first you need to add this line in your elasticsearch.yml file: script.inline: on To see some of the power of this function, look at the following example: GET profiles/candidate/_search {   "query": {     "function_score": {       "query": {         "term": {           "skills": {             "value": "java"           }         }       },       "functions": [         {           "script_score": {             "params": {               "skill_array_provided": [                 "java",                 "python"               ]             },             "script": "final_score=0; skill_array = doc['skills'].toArray(); counter=0; while(counter<skill_array.size()){for(skill in skill_array_provided){if(skill_array[counter]==skill){final_score = final_score+doc['total_experience'].value};};counter=counter+1;};return final_score"           }         }       ],       "boost_mode": "replace"     }   } } Let's understand the preceding query: params is the placeholder where you can pass the parameters to your function, similar to how you use parameters inside a method signature in other languages. Inside the script parameter, you write your complete logic. This script iterates through each document that has Java mentioned in the skills, and for each document, it fetches all the skills and stores them inside the skill_array variable. Finally, each skill that we have passed inside the params section is compared with the skills inside skill_array. If this matches, the value of the final_score variable is incremented with the value of the total_experience field of that document. The score calculated by the script score will be used to rank the documents because boost_mode is set to replace the original _score value. Do not try to work with the analyzed fields while writing the scripts. You might get weird results. This is because, had our skills field contained a value such as "core java", you could not have got the exact matching for it inside the script section. So, the fields with space-separated values need to be set as not_analyzed or the keyword has to be analyzed in advance. To write these script functions, you need to have some command over groovy scripting. However, if you find it complex, you can write these scripts in other languages, such as python, using the language plugin of Elasticsearch. More on this can be found here: https://github.com/elastic/elasticsearch-lang-python For a fast performance, use Groovy or Java functions. Python and JavaScript code requires the marshalling and unmarshalling of values that kill performances due to more CPU/memory usage. Java example The previous query can be implemented in Java in the following way: First, you need to import the following classes into your code: import org.elasticsearch.action.search.SearchResponse; import org.elasticsearch.client.Client; import org.elasticsearch.index.query.QueryBuilders; import org.elasticsearch.index.query.functionscore.*; import org.elasticsearch.script.Script; Then, the following code snippets can be used to implement the query: String script = "final_score=0; skill_array =            doc['skills'].toArray(); "         + "counter=0; while(counter<skill_array.size())"         + "{for(skill in skill_array_provided)"         + "{if(skill_array[counter]==skill)"         + "{final_score =     final_score+doc['total_experience'].value};};"         + "counter=counter+1;};return final_score";   ArrayList<String> skills = new ArrayList<String>();   skills.add("java");   skills.add("python");   Map<String, Object> params = new HashMap<String, Object>();   params.put("skill_array_provided",skills);   FunctionScoreQueryBuilder functionQuery = new   FunctionScoreQueryBuilder(QueryBuilders.termQuery("skills", "java"))     .add(new ScriptScoreFunctionBuilder(new Script(script,   ScriptType.INLINE, "groovy", params))).boostMode("replace");   SearchResponse response =   client.prepareSearch().setIndices(indexName)         .setTypes(docType).setQuery(functionQuery)         .execute().actionGet(); As you can see, the script logic is a simple string that is used to instantiate the Script class constructor inside ScriptScoreFunctionBuilder. Decay functions - linear, exp, gauss We have seen the problems of restricting the range of experience and distance that could result in getting zero results or no suitable candidates. May be a recruiter would like to hire a candidate from a different province because of a good candidate profile. So, instead of completely restricting with the range filters, we can incorporate sliding-scale values such as geo_location or dates into _score to prefer documents near a latitude/longitude point or recently published documents. Function score provide to work with this sliding scale with the help of three decay functions: linear, exp (that is, exponential), and gauss (that is, Gaussian). All three functions take the same parameter as shown in the following code and are required to control the shape of the curve created for the decay function: origin, scale, decay, and offset. The point of origin is used to calculate distance. For date fields, the default is the current timestamp. The scale parameter defines the distance from the origin at which the computed score will be equal to the decay parameter. The origin and scale parameters can be thought of as your min and max that define a bounding box within which the curve will be defined. If we want to give more boosts to the documents that have been published in the past10 days, it would be best to define the origin as the current timestamp and the scale as 10d. The offset specifies that the decay function will only compute the decay function of the  documents with a distance greater that the defined offset. The default is 0. Finally, the decay option alters how severely the document is demoted based on its position. The default decay value is 0.5. All three decay functions work only on numeric, date, and geo-point fields. GET profiles/candidate/_search {   "query": {     "function_score": {       "query": {         "match_all": {}       },       "functions": [         {           "exp": {             "geo_code": {               "origin": {                 "lat": 28.66,                 "lon": 77.22               },               "scale": "100km"             }           }         }       ],"boost_mode": "multiply"     }   } } In the preceding query, we have used the exponential decay function that tells Elasticsearch to start decaying the score calculation after a distance of 100 km from the given origin. So, the candidates who are at a distance of greater than 100km from the given origin will be ranked low, but not discarded. These candidates can still get a higher rank if we combine other functions score queries such as weight or field_value_factor with the decay function and combine the result of all the functions together. Java example: The preceding query can be implemented in Java in the following way: First, you need to import the following classes into your code: import org.elasticsearch.action.search.SearchResponse; import org.elasticsearch.client.Client; import org.elasticsearch.index.query.QueryBuilders; import org.elasticsearch.index.query.functionscore.*; Then, the following code snippets can be used to implement the query: Map<String, Object> origin = new HashMap<String, Object>();     String scale = "100km";     origin.put("lat", "28.66");     origin.put("lon", "77.22"); FunctionScoreQueryBuilder functionQuery = new     FunctionScoreQueryBuilder()     .add(new ExponentialDecayFunctionBuilder("geo_code",origin,     scale)).boostMode("multiply"); //For Linear Decay Function use below syntax //.add(new LinearDecayFunctionBuilder("geo_code",origin,   scale)).boostMode("multiply"); //For Gauss Decay Function use below syntax //.add(new GaussDecayFunctionBuilder("geo_code",origin,   scale)).boostMode("multiply");     SearchResponse response = client.prepareSearch().setIndices(indexName)         .setTypes(docType).setQuery(functionQuery)         .execute().actionGet(); In the preceding example, we have used the exp decay function but, the commented lines show examples of how other decay functions can be used. At last, as always, remember that Elasticsearch lets  you use multiple functions in a single function_score query to calculate a score that combines the results of each function. Summary Overall we covered the most important aspects of search engines, that is, relevancy. We discussed the powerful scoring capabilities available in Elasticsearch and the practical examples to show how you can control the scoring process according to your needs. Despite the relevancy challenges faced while working with search engines, the out–of-the-box features such as functions scores and custom scoring always allow us to tackle challenges with ease. Resources for Article:   Further resources on this subject: An Introduction to Kibana [article] Extending Chef [article] Introduction to Hadoop [article]
Read more
  • 0
  • 0
  • 3048

article-image-practical-applications-deep-learning
Packt
14 Jan 2016
20 min read
Save for later

Practical Applications of Deep Learning

Packt
14 Jan 2016
20 min read
In this article, Yusuke Sugomori, the author of the book Deep Learning with Java, we’ll first see how deep learning is actually applied. Here, you will see that the actual cases where deep learning is utilized are still very few. But why aren't there many cases even though it is such an innovative method? What is the problem? Later on, we’ll think about the reasons. Furthermore, going forward we will also consider which fields we can apply deep learning to and will have the chance to apply deep learning and all the related areas of artificial intelligence. The topics covered in this article include: The difficulties of turning deep learning models into practical applications The possible fields where deep learning can be applied, and ideas on how to approach these fields We'll explore the potential of this big AI boom, which will lead to ideas and hints that you can utilize in deep learning for your research, business, and many sorts of activities. (For more resources related to this topic, see here.) The difficulties of deep learning Deep learning has already got higher precision than humans in the image recognition field and has been applied to quite a lot of practical applications. Similarly, in the NLP field, many models have been researched. Then, how much deep learning is utilized in other fields? Surprisingly, there are still few fields where deep learning is successfully utilized. This is because deep learning is indeed innovative compared to past algorithms and definitely lets us take a big step towards materializing AI; however, it has some problems to be used for practical applications. The first problem is that there are too many model parameters in deep learning algorithms. We didn't look into detail when you learned about the theory and implementation of algorithms, but actually deep neural networks have many hyper parameters that need to be decided compared to the past neural networks or other machine learning algorithms. This means we have to go through more trial and error to get high precision. Combinations of parameters that define a structure of neural networks, such as how many hidden layers are to be set or how many units each hidden layer should have, need lots of experiments. Also, the parameters for training and test configurations such as the learning rateneed to be determined. Furthermore, peculiar parameters for each algorithm such as the corruption level in SDA and the size of kernels in CNN need additional trial and error. Thus, the great performance that deep learning provides is supported by steady parameter-tuning. However, people only look at one side of deep learning—that it can get great precision— and they tend to forget the hard process required to reach to the point. Deep learning is not magic. In addition, deep learning often fails to train and classify data from simple problems. The shape of deep neural networks is so deep and complicated that the weights can't be well optimized. In terms of optimization, data quantities are also important. This means that deep neural networks require a significant amount of time for each training. To sum up, deep learning shows its worth when: It solves complicated and hard problems when people have no idea what feature they can be classified as There is sufficient training data to properly optimize deep neural networks Compared to applications that constantly update a model using continuously updated data, once a model is built using a large-scale data set that doesn't change drastically, applications that use the model universally are rather well suited for deep learning. Therefore, when you look at business fields, you can say that there are more cases where the existing machine learning can get better results than using deep learning. For example, let's assume we would like to recommend appropriate products to users in an EC. In this EC, many users buy a lot of products daily, so purchase data is largely updated daily. In this case, do you use deep learning to get high-precision classification and recommendations to increase the conversion rates of users' purchases using this data? Probably not, because using the existing machine learning algorithms such as Naive Bayes, collaborative filtering, SVM, and so on, we can get sufficient precision from a practical perspective and can update the model and calculate quicker, which is usually more appreciated. This is why deep learning is not applied much in business fields. Of course, getting higher precision is better in any field, but in reality, higher precision and the necessary calculation time are in a trade-off relationship. Although deep learning is significant in the research field, it has many hurdles yet to clear considering practical applications. Besides, deep learning algorithms are not perfect, and they still need many improvements to their model itself. For example, RNN, as mentioned earlier, can only satisfy either how past information can be reflected to a network or how precision can be obtained, although it's contrived with techniques such as LSTM. Also, deep learning is still far from the true AI, although it's definitely a great technique compared to the past algorithms. Research on algorithms is progressing actively, but in the meantime, we need one more breakthrough to spread out and infiltrate deep learning into broader society. Maybe this is not just the problem of a model. Deep learning is suddenly booming because it is reinforced by huge developments in hardware and software. Deep learning is closely related to development of the surrounding technology. As mentioned earlier, there are still many hurdles to clear before deep learning can be applied more practically in the real world, but this is not impossible to achieve. It isn't possible to suddenly invent AI to achieve technological singularity, but there are some fields and methods where deep learning can be applied right away. In the next section, we’ll think about what kinds of industries deep learning can be utilized in. Hopefully, it will sew the seeds for new ideas in your business or research fields. The approaches to maximize deep learning possibilities and abilities There are several approaches on how we can apply deep learning to various industries. While it is true that an approach could be different depending on the task or purpose, we can briefly categorized the approaches as the following three: Field-oriented approach: This utilizes deep learning algorithms or models that are already thoroughly researched and can lead to a great performance Breakdown-oriented approach: This replaces problems to be solved that deep learning can apparently be applied with a different problem that deep learning can be well adopted Output-oriented approach: This explores new ways on how we express the output with deep learning These approaches are all explained in detail in the following subsections. Each approach is divided into its suitable industries or not for its use, but any of them could be a big hint for your activities going forward. There are still very few use cases of deep learning and bias against fields of use, but this means there should be many chances to create innovative and new things. Start-ups who utilize deep learning have been emerging recently and some of them have already achieved success to some extent. You can have a significant impact on the world depending on your ideas. Field-oriented approach This approach doesn't require new techniques or algorithms. There are obviously fields that are well suited to the current deep learning techniques, and the concept here is to dive into these fields. As explained previously, since deep learning algorithms that have been practically studied and developed are mostly in image recognition and NLP, we'll explore some fields that can work in great harmony with them. Medicine Medical fields should be developed by deep learning. Tumors or cancers are detected on scanned images. This means nothing else but being able to utilize one of the strongest features of deep learning—the technique of image recognition. It is possible to dramatically increase precision using deep learning to help with the early detection of an illness and identifying the kind of illness. Since CNN can be applied to 3D images, 3D scanned images should be able to be analyzed relatively easily. By adopting deep learning more in the current medical field, deep learning should greatly contribute. We can also say that deep learning can be significantly useful for the future medical field. The medical field has been under strict regulations, however, there is a movement progressing to ease the regulations in some countries, probably because of the recent development of IT and its potential. Therefore, there will be opportunities in business for the medical field and IT to have a synergy effect. For example, if telemedicine is more infiltrated, there is the possibility that diagnosing or identifying a disease can be done not only by a scanned image, but also by an image shown in real time on a display. Also, if electronic charts become widespread, it would be easier to analyze medical data using deep learning. This is because medical records are compatible with deep learning as they are a dataset of texts and images. Then, the symptom of unknown disease can be found. Automobiles We can say that surroundings off running cars are image sequences and texts. Other cars and views are images and a road sign has texts. This means we can also utilize deep learning techniques here, and it is possible to reduce the risk of accidents by improving driving assistance functions. It can be said that the ultimate type of driving assistance is self-driving cars, which is being tackled mainly by Google and Tesla. An example that is both famous and fascinating was when George Hotz, the first person to hack the iPhone, built a self-driving car in his garage. The appearance of the car was introduced in an article by Bloomberg Business (http://www.bloomberg.com/features/2015-george-hotz-self-driving-car/), and the following image was included in the article: Self-driving cars have been already tested in the U.S., but since other countries have different traffic rules and road conditions, this idea requires further studying and development before self-driving cars are commonly used worldwide. The key to success in this field is in learning and recognizing surrounding cars, people, views, and traffic signs, and properly judging how to process them. In the meantime, we don't have to just focus on utilizing deep learning techniques for the actual body of a car. Let's assume we could develop a smartphone app that has the same function as we just described, that is, recognizing and classifying surrounding images and text. Then, if you just set up the smartphone in your car, you could utilize it as a car-navigation app. In addition, for example, it could be used as a navigation app for blind people, providing them with good reliable directions. Advert technologies Advert (ad) technologies could expand their coverage with deep learning. When we say ad technologies, this currently means recommendation or ad networks that optimize ad banners or products to be shown. On the other hand, when we say advertising, this doesn't only mean banners or ad networks. There are various kinds of ads in the world depending on the type of media such as TV ads, radio ads, newspaper ads, posters, flyers, and so on. We have also digital ad campaigns with YouTube, Vine, Facebook, Twitter, Snapchat, and so on. Advertising itself has changed its definition and content, but all ads have one thing in common, they consist of images and/or language. This means they are fields that deep learning is good at. Until now, we could only use user-behavior-based indicators, such as page view (PV), click through rate (CTR), and conversion rate (CVR), to estimate the effect of an ad, but if we apply deep learning technologies, we might be able to analyze the actual content of an ad and autogenerate ads going forward. Especially since movies and videos can only be analyzed as a result of image recognition and NLP, video recognition, not image recognition, will gather momentum besides ad technologies. Profession or practice Professions such as doctor, lawyer, patent attorney, and accountant are considered to be roles that deep learning can replace. For example, if NLP's precision and accuracy gets higher, any perusal that requires expertise can be left to a machine. As a machine can cover these time-consuming reading tasks, people can focus more on high-value tasks. In addition, if a machine classifies past judicial cases or medical cases on what disease caused what symptoms and so on, we would be able to build an app like Apple’s Siri that answers simple questions that usually require professional knowledge. Then, the machine could handle these professional cases to some extent if a doctor or a lawyer is too busy to help in a timely manner. It's often said that AI takes away human’s jobs, but personally, this seems incorrect. Rather, a machine takes away menial work, which should support humans. A software engineer who works on AI programming can described as having a professional job, but this work will also be changed in the future. For example, think about a car-related job, where the current work is building standard automobiles, but in the future engineers will be in a position just like pit crews for Formula 1 cars. Sports Deep learning can certainly contribute to sports as well. In the study field known as sports science, it has become increasingly important to analyze and examine data from sports. As an example, you may know the book or movie Moneyball. In this film, they hugely increased the win percentage of the team by adopting a regression model in baseball. Watching sports itself is very exciting, but on the other hand, sport can be seen as a chunk of image sequences and number data. Since deep learning is good at identifying features that humans can't find, it will become easier to find out why certain players get good scores while others don't. These fields we have mentioned are only a small part of the many fields where deep learning is capable of significantly contributing to development. We have looked into these fields from the perspective of whether a field has images or text, but of course deep learning should also show great performance for simple analysis with general number data. It should be possible to apply deep learning to various other fields such as bioinformatics, finance, agriculture, chemistry, astronomy, economy, and more. Breakdown-oriented approach This approach might be similar to the approach considered in traditional machine learning algorithms. We already talked about how feature engineering is the key to improving precision in machine learning. Now we can say that this feature engineering can be divided into the following two parts: Engineering under the constraints of a machine learning model. The typical case is to make inputs discrete or continuous. Feature engineering to increase precision by machine learning. This tends to rely on the sense of a researcher. In a narrower meaning, feature engineering is considered as the second one, and this is the part that deep learning doesn't have to focus on, whereas the first one is definitely the important part even for deep learning. For example, it's difficult to predict stock prices using deep learning. Stock prices are volatile and it’s difficult to define inputs. Besides, how to apply an output value is also a difficult problem. Enabling deep learning to handle these inputs and outputs is also said to be feature engineering in the wider sense. If there is no limitation to the value of original data and/or data you would like to predict, it’s difficult to insert these datasets into machine learning and deep learning algorithms, including neural networks. However, we can take a certain approach and apply a model to these previous problems by breaking down the inputs and/or outputs. In terms of NLP as explained earlier, you might have thought, for example, that it would be impossible to put numberless words into features in the first place, but as you already know, we can train feed-forward neural networks with words by representing them with sparse vectors and combining N-grams into them. Of course, we can not only use neural networks, but also other machine learning algorithms such as SVM here. Thus, we can cultivate a new field where deep learning hasn't been applied by engineering to fit features well into deep learning models. In the meantime, when we focus on NLP, we can see that RNN and LSTM were developed to properly resolve the difficulties or tasks encountered in NLP. This can be considered as the opposite approach to feature engineering because in this case, the problem is solved by breaking down a model to fit into features. Then, how do we do engineering for stock prediction as we just mentioned? It's actually not difficult to think of inputs, that is, features. For example, if you predict stock prices daily, it’s hard to calculate if you use daily stock prices as features, but if you use a rate of price change between a day and the day before, then it should be much easier to process as the price stays within a certain range and the gradients won't explode easily. Meanwhile, what is difficult is how to deal with outputs. Stock prices are of course continuous values, hence outputs can be various values. This means that in neural network model where the number of units in the output layer is fixed, they can't handle this problem. What should we do here—should we give up?! No, wait a minute. Unfortunately, we can't predict a stock price itself, but there is an alternative prediction method. Here, the problem is that we can classify stock prices to be predicted into infinite patterns. Then, can we make them into limited patterns? Yes, we can. Let's forcibly make them. Think about the most extreme but easy to understand case: predicting whether a tomorrow's stock price, strictly speaking a close price, is up or down using the data from the stock price up to today. For this case, we can show it with a deep learning model as follows: In the preceding image, denotes the open price of a day, ; denotes the close price, is the high price, and is the actual price. The features used here are mere examples, and need to be fine-tuned when applied to real applications. The point here is that replacing the original task with this type of problem enables deep neural networks to theoretically classify data. Furthermore, if you classify the data by how much it will go up or down, you could make more detailed predictions. For example, you could classify data as shown in the following table: Class Description Class 1 Up more than 3 percent from the closing price Class 2 Up more than 1~3 percent from the closing price Class 3 Up more than 0~1 percent from the closing price Class 4 Down more than 0~-1 percent from the closing price Class 5 Down more than -1~-3 percent from the closing price Class 6 Down more than -3 percent from the closing price Whether the prediction actually works, in other words whether the classification works, is unknown until we examine it, but the fluctuation of stock prices can be predicted in a quite narrow range by dividing the outputs into multiple classes. Once we can adopt the task into neural networks, then what we should do is just examine which model gets better results. In this example, we may apply RNN because the stock price is time sequential data. If we look at charts showing the price as image data, we can also use CNN to predict the future price. So now we've thought about the approach by referring to examples, but to sum up in general, we can say that: Feature engineering for models: This is designing inputs or adjusting values to fit deep learning models, or enabling classification by setting a limitation for the outputs Model engineering for features: This is devising new neural network models or algorithms to solve problems in a focused field The first one needs ideas for the part of designing inputs and outputs to fit to a model, whereas the second one needs to take a mathematical approach. Feature engineering might be easier to start if you are conscious of making an item prediction-limited. Output-oriented approach The previously mentioned two approaches are to increase the percentage of correct answers for a certain field's task or problem using deep learning. Of course, it is essential and the part where deep learning proves its worth, however, increasing precision to the ultimate level may not be the only way of utilizing deep learning. Another approach is to devise the outputs using deep learning by slightly changing the point of view. Let's see what this means. Deep learning is applauded as an innovative approach among researchers and technical experts of AI, but the world in general doesn't know much about its greatness yet. Rather, they pay attention to what a machine can't do. For example, people don't really focus on the image recognition capabilities of MNIST using CNN, which generates a lower error rate than humans, but they criticize that a machine can't recognize images perfectly. This is probably because people expect a lot when they hear and imagine AI. We might need to change this mindset. Let's consider DORAEMON, a Japanese national cartoon character who is also famous worldwide—a robot who has high intelligence and AI, but often makes silly mistakes. Do we criticize him? No, we just laugh it off or take it as a joke and don’t get serious. Also, think about DUMMY / DUM-E, the robot arm in the movie Iron Man. It has AI as well, but makes silly mistakes. See, they make mistakes but we still like them. In this way, it might be better to emphasize the point that machines make mistakes. Changing the expression part of a user interface could be the trigger for people to adopt AI rather than just studying an algorithm the most. Who knows? It’s highly likely that you can gain the world’s interest by thinking of ideas in creative fields, not from the perspective of precision. Deep Dream by Google is one good example. We can do more exciting things when art or design and deep learning collaborate. Summary And …congratulations! You’ve just accomplished the learning part of deep learning with Java. Although there are still some models that have not been mentioned, you can be sure there will be no problem in acquiring and utilizing them. Resources for Article: Further resources on this subject: Setup Routine for an Enterprise Spring Application[article] Support for Developers of Spring Web Flow 2[article] Using Spring JMX within Java Applications[article]
Read more
  • 0
  • 0
  • 2735