Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-6-popular-regression-techniques-must-know
Amey Varangaonkar
15 Feb 2018
8 min read
Save for later

6 Popular Regression Techniques you must know

Amey Varangaonkar
15 Feb 2018
8 min read
[box type="note" align="" class="" width=""]The following excerpt is taken from the book Statistics for Data Science, authored by IBM expert James D. Miller. This book gives a statistical view of building smart data models to help you get unique insights from the data.[/box] In this article, we introduce you to the concept of regression analysis, one of the most popular machine learning algorithms. -You will learn what is regression analysis, the different types of regression, and how to choose the right regression technique to build your data model. What is Regression Analysis? For starters, regression analysis or statistical regression is a process for estimating the relationships among variables. This process encompasses numerous techniques for modeling and analyzing variables, focusing on the relationship between a dependent variable and one (or more) independent variable (or predictors). Regression analysis is the work done to identify and understand how the (best representative) value of a dependent variable (a variable that depends on other factors) changes when any one of the independent variables (a variable that stands alone and isn't changed by the other variables) is changed while the other independent variables stay the same. A simple example might be how the total dollars spent on marketing (an independent variable example) impacts the total sales dollars (a dependent variable example) over a period of time (is it really as simple as more marketing equates to higher sales?), or perhaps there is a correlation between the total marketing dollars spent (independent variable), discounting a products price (another independent variable), and the amount of sales (a dependent variable)? [box type="info" align="" class="" width=""]Keep in mind this key point that regression analysis is used to understand which among the independent variables are related to the dependent variable(s), not just the relationship of these variables. Also, the inference of causal relationships (between the independent and dependent variables) is an important objective. However, this can lead to illusions or false relationships, so caution is recommended![/box] Overall, regression analysis can be thought of as estimating the conditional expectations of the value of the dependent variable, given the independent variables being observed, that is, endeavoring to predict the average value of the dependent variable when the independent variables are set to certain values. I call this the lever affect—meaning when one increases or decreases a value of one component, it directly affects the value at least one other (variable). An alternate objective of the process of regression analysis is the establishment of location parameters or the quantile of a distribution. In other words, this idea is to determine values that may be a cutoff, dividing a range of a probability distribution values. You'll find that regression analysis can be a great tool for prediction and forecasting (not just complex machine learning applications). We'll explore some real-world examples later, but for now, let's us look at some techniques for the process. Popular regression techniques and approaches You'll find that various techniques for carrying out regression analysis have been developed and accepted.These are: Linear Logistic Polynomial Stepwise Ridge Lasso Linear regression Linear regression is the most basic type of regression and is commonly used for predictive analysis projects. In fact, when you are working with a single predictor (variable), we call it simple linear regression, and if there are multiple predictor variables, we call it multiple linear regression. Simply put, linear regression uses linear predictor functions whose values are estimated from the data in the model. Logistic regression Logistic regression is a regression model where the dependent variable is a categorical variable. This means that the variable only has two possible values, for example, pass/fail, win/lose, alive/dead, or healthy/sick. If the dependent variable has more than two possible values, one can use various modified logistic regression techniques, such as multinomial logistic regression, ordinal logistic regression, and so on. Polynomial regression When we speak of polynomial regression, the focus of this technique is on modeling the relationship between the independent variable and the dependent variable as an nth degree polynomial. Polynomial regression is considered to be a special case of multiple linear regressions. The predictors resulting from the polynomial expansion of the baseline predictors are known as interactive features. Stepwise regression Stepwise regression is a technique that uses some kind of automated procedure to continually execute a step of logic, that is, during each step, a variable is considered for addition to or subtraction from the set of independent variables based on some prespecified criterion. Ridge regression Often predictor variables are identified as being interrelated. When this occurs, the regression coefficient of any one variable depends on which other predictor variables are included in the model and which ones are left out. Ridge regression is a technique where a small bias factor is added to the selected variables in order to improve this situation. Therefore, ridge regression is actually considered a remedial measure to alleviate multicollinearity amongst predictor variables. Lasso regression Lasso (Least Absolute Shrinkage Selector Operator) regression is a technique where both predictor variable selection and regularization are performed in order to improve the prediction accuracy and interpretability of the result it produces. Which technique should I choose? In addition to the aforementioned regression techniques, there are numerous others to consider with, most likely, more to come. With so many options, it's important to choose the technique that is right for your data and your project. Rather than selecting the right regression approach, it is more about selecting the most effective regression approach. Typically, you use the data to identify the regression approach you'll use. You start by establishing statistics or a profile for your data. With this effort, you need to identify and understand the importance of the different variables, their relationships, coefficient signs, and their effect. Overall, here's some generally good advice for choosing the right regression approach from your project: Copy what others have done and had success with. Do the research. Incorporate the results of other projects into yours. Don't reinvent the wheel. Also, even if an observed approach doesn't quite fit as it was used, perhaps some simple adjustments would make it a good choice. Keep your approach as simple as possible. Many studies show that simpler models generally produce better predictions. Start simple, and only make the model more complex as needed. The more complex you make your model, the more likely it is that you are tailoring the model to your dataset specifically, and generalizability suffers. Check your work. As you evaluate methods, check the residual plots (more on this in the next section of this chapter) because they can help you avoid inadequate models and adjust your model for better results. Use your subject matter expertise. No statistical method can understand the underlying process or subject area the way you do. Your knowledge is a crucial part and, most likely, the most reliable way of determining the best regression approach for your project. Does it fit? After selecting a model that you feel is appropriate for use with your data (also known as determining that the approach is the best fit), you need to validate your selection, that is, determine its fit. A well-fitting regression model results in predicted values close to the observed data values. The mean model (which uses the mean for every predicted value) would generally be used if there were no informative predictor variables. The fit of a proposed regression model should, therefore, be better than the fit of the mean model. As a data scientist, you will need to scrutinize the coefficients of determination, measure the standard error of estimate, analyze the significance of regression parameters and confidence intervals. [box type="info" align="" class="" width=""]Remember that the better the fit of a regression model, most likely the better the precision in, or just better, the results.[/box] Finally, it has been proven that simple models produce more accurate results! Keep this in mind always when selecting an approach or a technique, and even when the problem might be complex, it is not always obligatory to adopt a complex regression approach. Choosing the right technique, though, goes a long way in developing an accurate model. If you found this excerpt useful, make sure to check out this book Statistics for Data Science for tips on building effective data models by leveraging the power of the statistical tools and techniques.
Read more
  • 0
  • 0
  • 3284

article-image-how-to-use-r-to-boost-your-data-model
Amey Varangaonkar
15 Feb 2018
8 min read
Save for later

How to use R to boost your Data Model

Amey Varangaonkar
15 Feb 2018
8 min read
[box type="note" align="" class="" width=""]The following book excerpt is taken from the title Statistics for Data Science, written by James D. Miller. This book is a comprehensive primer on the basic concepts of statistics and their application in different data science tasks.[/box] In this article, we explain the implementation of boosting - a popular technique used to improve the performance of a data model - using the popular R programming language. We will take a high-level look at a thought-provoking prediction problem drawn from Mastering Predictive Analytics with R, Second Edition, by James D. Miller and Rui Miguel Forte. Here, an original example of patterns made by radiation on a telescope camera are analyzed in an attempt to predict whether a certain pattern came from gamma rays leaking into the atmosphere or from regular background radiation. Gamma rays leave distinctive elliptical patterns and so we can create a set of features to describe these. The dataset used is the MAGIC Gamma Telescope Data Set, hosted by the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope This data consists of 19,020 observations, holding the following list of attributes: Prepping the data First, various steps need to be performed on our example data. The data is first loaded into an R data frame object named magic, recoding the CLASS output variable to use classes 1 and -1 for gamma rays and background radiation respectively: > magic <- read.csv("magic04.data", header = FALSE) > names(magic) <- c("FLENGTH", "FWIDTH", "FSIZE", "FCONC", "FCONC1",  "FASYM", "FM3LONG", "FM3TRANS", "FALPHA", "FDIST", "CLASS") > magic$CLASS <- as.factor(ifelse(magic$CLASS =='g', 1, -1)) Next, the data is split into two files: a training data and a test data frame using an 80-20 split: > library(caret) > set.seed(33711209) > magic_sampling_vector <- createDataPartition(magic$CLASS,  p = 0.80, list = FALSE) > magic_train <- magic[magic_sampling_vector, 1:10] > magic_train_output <- magic[magic_sampling_vector, 11] > magic_test <- magic[-magic_sampling_vector, 1:10] > magic_test_output <- magic[-magic_sampling_vector, 11] The model used for boosting is a simple multilayer perceptron with a single hidden layer leveraging R's nnet package. Neural networks, often produce higher accuracy when inputs are normalized, so, in this example, before training any models, this preprocessing is performed: > magic_pp <- preProcess(magic_train, method = c("center",  "scale")) > magic_train_pp <- predict(magic_pp, magic_train) > magic_train_df_pp <- cbind(magic_train_pp,  CLASS = magic_train_output) > magic_test_pp <- predict(magic_pp, magic_test) Training Boosting is designed to work best with weak learners, so a very small number of hidden neurons in the model's hidden layer are used. Concretely, we will begin with the simplest possible multilayer perceptron that uses a single hidden neuron. To understand the effect of using boosting, a baseline performance is established by training a single neural network (and measuring its performance). This is to accomplish the following: > library(nnet) > n_model <- nnet(CLASS ~ ., data = magic_train_df_pp, size = 1) > n_test_predictions <- predict(n_model, magic_test_pp,  type = "class") > (n_test_accuracy <- mean(n_test_predictions ==  magic_test_output))  [1] 0.7948988 This establishes that we have a baseline accuracy of around 79.5 percent. Not too bad, but can boost to improve upon this score? To that end, the function AdaBoostNN(), which is shown as follows, is used. This function will take input from a data frame, the name of the output variable, the number of single hidden layer neural network models to be built, and finally, the number of hidden units these neural networks will have. The function will then implement the AdaBoost algorithm and return a list of models with their corresponding weights. Here is the function: AdaBoostNN <- function(training_data, output_column, M, hidden_units) { require("nnet") models <- list() alphas <- list() n <- nrow(training_data) model_formula <- as.formula(paste(output_column, '~ .', sep = '')) w <- rep((1/n), n) for (m in 1:M) { model <- nnet(model_formula, data = training_data, size = hidden_units, weights = w) models[[m]] <- model predictions <- as.numeric(predict(model, training_data[, -which(names(training_data) == output_column)], type = "class")) errors <- predictions != training_data[, output_column] error_rate <- sum(w * as.numeric(errors)) / sum(w) alpha <- 0.5 * log((1 - error_rate) / error_rate) alphas[[m]] <- alpha temp_w <- mapply(function(x, y) if (y) { x * exp(alpha) } else { x * exp(-alpha)}, w, errors) w <- temp_w / sum(temp_w) } return(list(models = models, alphas = unlist(alphas))) } The preceding function uses the following logic: First, initialize empty lists of models and model weights (alphas). Compute the number of observations in the training data, storing this in the variable n. The name of the output column provided is then used to create a formula that describes the neural network that will be built. In the dataset used, this formula will be CLASS ~ ., meaning that the neural network will compute CLASS as a function of all the other columns as input features. Next, initialize the weights vector and define a loop that will run for M iterations in order to build M models. In every iteration, the first step is to use the current setting of the weights vector to train a neural network using as many hidden units as specified in the input, hidden_units. Then, compute a vector of predictions that the model generates on the training data using the predict() function. By comparing these predictions to the output column of the training data, calculate the errors that the current model makes on the training data. This then allows the computation of the error rate. This error rate is set as the weight of the current model and, finally, the observation weights to be used in the next iteration of the loop are updated according to whether each observation was correctly classified. The weight vector is then normalized and we are ready to begin the next iteration! After completing M iterations, output a list of models and their corresponding model weights. Ready for boosting There is now a function able to train our ensemble classifier using AdaBoost, but we also need a function to make the actual predictions. This function will take in the output list produced by our training function, AdaBoostNN(), along with a test dataset. This function is AdaBoostNN.predict() and it is shown as follows: AdaBoostNN.predict <- function(ada_model, test_data) { models <- ada_model$models alphas <- ada_model$alphas prediction_matrix <- sapply(models, function (x) as.numeric(predict(x, test_data, type = "class"))) weighted_predictions <- t(apply(prediction_matrix, 1, function(x) mapply(function(y, z) y * z, x, alphas))) final_predictions <- apply(weighted_predictions, 1, function(x) sign(sum(x))) return(final_predictions) } This function first extracts the models and the model weights (from the list produced by the previous function). A matrix of predictions is created, where each column corresponds to the vector of predictions made by a particular model. Thus, there will be as many columns in this matrix as the models that we used for boosting. We then multiply the predictions produced by each model with their corresponding model weight. For example, every prediction from the first model is in the first column of the prediction matrix and will have its value multiplied by the first model weight α1. .Lastly, the matrix of weighted observations is reduced into a single vector of observations by summing the weighted predictions for each observation and taking the sign of the result. This vector of predictions is then returned by the function. As an experiment, we will train ten neural network models with a single hidden unit and see if boosting improves accuracy: > ada_model <- AdaBoostNN(magic_train_df_pp, 'CLASS', 10, 1) > predictions <- AdaBoostNN.predict(ada_model, magic_test_pp,   'CLASS') > mean(predictions == magic_test_output) [1] 0.804365 We see in this example, boosting ten models shows a marginal improvement in accuracy, but perhaps training more models might make more of a difference. What we learned From the preceding example, you may conclude that, for the neural networks with one hidden unit, as the number of boosting models increases, we see an improvement in accuracy, but after 100 models, this tapers off and is actually slightly less for 200 models. The improvement over the baseline of a single model is substantial for these networks. When we increase the complexity of our learner by having a hidden layer with three hidden neurons, we get a much smaller improvement in performance. At 200 models, both ensembles perform at a similar level, indicating that, at this point, our accuracy is being limited by the type of model trained. If you found this article useful, make sure to check out the book Statistics for Data Science for interesting statistical techniques and their implementation in R.  
Read more
  • 0
  • 0
  • 2229

article-image-make-efficient-data-driven-decisions
Aaron Lazar
15 Feb 2018
7 min read
Save for later

How to make efficient data-driven decisions

Aaron Lazar
15 Feb 2018
7 min read
[box type="note" align="" class="" width=""]This article is an extract from the book Predictive Analytics with TensorFlow, authored by Md. Rezaul Karim. The book will help you build, tune, and deploy predictive data models with TensorFlow.[/box] Today we’ll learn to take decisions driven by data with the help of few examples. The growing demand for data is a key challenge. Decision support teams such as institutional research and business intelligence often cannot take the right decisions on how to expand their business and research outcomes from a huge collection of data. Although data plays an important role in driving the decision, however, in reality, taking the right decision at right time is the goal. In other words, the goal is the decision support, not the data support. This can be achieved through an advanced use of data management and analytics. Data value chain for making decisions The following diagram in figure 1 (source: H. Gilbert Miller and Peter Mork, From Data to Decisions: A Value Chain for Big Data, Proc. Of IT Professional, Volume: 15, Issue: 1, Jan.-Feb. 2013, DOI: 10.1109/MITP.2013.11) shows the data chain towards taking actual decisions–that is, the goal. The value chains start through the data discovery stage consisting of several steps such as data collection and annotating data preparation, and then organizing them in a logical order having the desired flow. Then comes the data integration for establishing a common data representation of the data. Since the target is to take the right decision, for future reference having the appropriate provenance of the data–that is, where it comes from, is important: Well, now your data is somehow integrated into a presentable format, it's time for the data exploration stage, which consists of several steps such as analyzing the integrated data and visualization before taking the actions to take on the basis of the interpreted results. However, is this enough before taking the right decision? Probably not! The reason is that it lacks enough analytics, which eventually helps to take the decision with an actionable insight. Predictive analytics comes in here to fill the gap between. Now let's see an example of how in the following section. From disaster to decision – Titanic survival example Here is the challenge, Titanic–Machine Learning from Disaster from Kaggle (https://www.kaggle.com/c/titanic): "The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy" But going into this deeper, we need to know about the data of passengers travelling in the Titanic during the disaster so that we can develop a predictive model that can be used for survival analysis. The dataset can be downloaded from the preceding URL. Table 1 here shows the metadata about the Titanic survival dataset: A snapshot of the dataset can be seen as follows: The ultimate target of using this dataset is to predict what kind of people survived the Titanic disaster. However, a bit of exploratory analysis of the dataset is a mandate. At first, we need to import necessary packages and libraries: import pandas as pd import matplotlib.pyplot as plt import numpy as np Now read the dataset and create a panda's DataFrame: df = pd.read_csv('/home/asif/titanic_data.csv') Before drawing the distribution of the dataset, let's specify the parameters for the graph: fig = plt.figure(figsize=(18,6), dpi=1600) alpha=alpha_scatterplot = 0.2 alpha_bar_chart = 0.55 fig = plt.figure() ax = fig.add_subplot(111) Draw a bar diagram for showing who survived versus who did not: ax1 = plt.subplot2grid((2,3),(0,0)) ax1.set_xlim(-1, 2) df.Survived.value_counts().plot(kind='bar', alpha=alpha_bar_chart) plt.title("Survival distribution: 1 = survived") Plot a graph showing survival by Age: plt.subplot2grid((2,3),(0,1)) plt.scatter(df.Survived, df.Age, alpha=alpha_scatterplot) plt.ylabel("Age") plt.grid(b=True, which='major', axis='y') plt.title("Survival by Age: 1 = survived") Plot a graph showing distribution of the passengers classes: ax3 = plt.subplot2grid((2,3),(0,2)) df.Pclass.value_counts().plot(kind="barh", alpha=alpha_bar_chart) ax3.set_ylim(-1, len(df.Pclass.value_counts())) plt.title("Class dist. of the passengers") Plot a kernel density estimate of the subset of the 1st class passengers' age: plt.subplot2grid((2,3),(1,0), colspan=2) df.Age[df.Pclass == 1].plot(kind='kde') df.Age[df.Pclass == 2].plot(kind='kde') df.Age[df.Pclass == 3].plot(kind='kde') plt.xlabel("Age") plt.title("Age dist. within class") plt.legend(('1st Class', '2nd Class','3rd Class'),loc='best') Plot a graph showing passengers per boarding location: ax5 = plt.subplot2grid((2,3),(1,2)) df.Embarked.value_counts().plot(kind='bar', alpha=alpha_bar_chart) ax5.set_xlim(-1, len(df.Embarked.value_counts())) plt.title("Passengers per boarding location") Finally, we show all the subplots together: plt.show() >>> The figure shows the survival distribution, survival by age, age distribution, and the passengers per boarding location: However, to execute the preceding code, you need to install several packages such as matplotlib, pandas, and scipy. They are listed as follows: Installing pandas: Pandas is a Python package for data manipulation. It can be installed as follows: $ sudo pip3 install pandas #For Python 2.7, use the following: $ sudo pip install pandas Installing matplotlib: In the preceding code, matplotlib is a plotting library for mathematical objects. It can be installed as follows: $ sudo apt-get install python-matplotlib # for Python 2.7 $ sudo apt-get install python3-matplotlib # for Python 3.x Installing scipy: Scipy is a Python package for scientific computing. Installing blas and lapack and gfortran are a prerequisite for this one. Now just execute the following command on your terminal: $ sudo apt-get install libblas-dev liblapack-dev $ sudo apt-get install gfortran $ sudo pip3 install scipy # for Python 3.x $ sudo pip install scipy # for Python 2.7 For Mac, use the following command to install the above modules: $ sudo easy_install pip $ sudo pip install matplotlib $ sudo pip install libblas-dev liblapack-dev $ sudo pip install gfortran $ sudo pip install scipy For windows, I am assuming that Python 2.7 is already installed at C:Python27. Then open the command prompt and type the following command: C:Usersadmin-karim>cd C:/Python27 C:Python27> python -m pip install <package_name> # provide package name accordingly. For Python3, issue the following commands: C:Usersadmin-karim>cd C:Usersadmin-karimAppDataLocalPrograms PythonPython35Scripts C:Usersadmin-karimAppDataLocalProgramsPythonPython35 Scripts>python3 -m pip install <package_name> Well, we have seen the data. Now it's your turn to do some analytics on top of the data. Say predicting what kinds of people survived from that disaster. Don't you agree that we have enough information about the passengers, but how could we do the predictive modeling so that we can draw some fairly straightforward conclusions from this data? For example, say being a woman, being in 1st class, and being a child were all factors that could boost passenger chances of survival during this disaster. In a brute-force approach–for example, using if/else statements with some sort of weighted scoring system, you could write a program to predict whether a given passenger would survive the disaster. However, does writing such a program in Python make much sense? Naturally, it would be very tedious to write, difficult to generalize, and would require extensive fine tuning for each variable and samples (that is, passenger). This is where predictive analytics with machine learning algorithms and emerging tools comes in so that you could build a program that learns from sample data to predict whether a given passenger would survive. If you found this post useful and would like to explore more, head over to grab the book, Predictive Analytics with TensorFlow written by Md. Rezaul Karim.    
Read more
  • 0
  • 0
  • 1589
Visually different images

article-image-build-generative-chatbot-using-recurrent-neural-networks-lstm-rnns
Savia Lobo
15 Feb 2018
8 min read
Save for later

Build a generative chatbot using recurrent neural networks (LSTM RNNs)

Savia Lobo
15 Feb 2018
8 min read
In today’s tutorial we will learn to build generative chatbot using recurrent neural networks. The RNN used here is Long Short Term Memory(LSTM). Generative chatbots are very difficult to build and operate. Even today, most workable chatbots are retrieving in nature; they retrieve the best response for the given question based on semantic similarity, intent, and so on. For further reading, refer to the paper Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation by Kyunghyun Cho et. al. (https://arxiv.org/pdf/1406.1078.pdf). [box type="note" align="" class="" width=""]This article is an excerpt from a book written by Krishna Bhavsar, Naresh Kumar, and Pratap Dangeti, titled Natural Language Processing with Python Cookbook. In this book you will come across various recipes covering natural language understanding, Natural Language Processing, and syntactic analysis.[/box] Getting ready... The A.L.I.C.E Artificial Intelligence Foundation dataset bot.aiml Artificial Intelligence Markup Language (AIML), which is customized syntax such as XML file has been used to train the model. In this file, questions and answers are mapped. For each question, there is a particular answer. Complete .aiml files are available at aiml-en-us-foundation-alice.v1-9 from https://code.google.com/archive/p/aiml-en-us-foundation-alice/downloads. Unzip the folder to see the bot.aiml file and open it using Notepad. Save as bot.txt to read in Python: >>> import os """ First change the following directory link to where all input files do exist """ >>> os.chdir("C:UsersprataDocumentsbook_codesNLP_DL") >>> import numpy as np >>> import pandas as pd # File reading >>> with open('bot.txt', 'r') as content_file: ... botdata = content_file.read() >>> Questions = [] >>> Answers = [] AIML files have unique syntax, similar to XML. The pattern word is used to represent the question and the template word for the answer. Hence, we are extracting respectively: >>> for line in botdata.split("</pattern>"): ... if "<pattern>" in line: ... Quesn = line[line.find("<pattern>")+len("<pattern>"):] ... Questions.append(Quesn.lower()) >>> for line in botdata.split("</template>"): ... if "<template>" in line: ... Ans = line[line.find("<template>")+len("<template>"):] ... Ans = Ans.lower() ... Answers.append(Ans.lower()) >>> QnAdata = pd.DataFrame(np.column_stack([Questions,Answers]),columns = ["Questions","Answers"]) >>> QnAdata["QnAcomb"] = QnAdata["Questions"]+" "+QnAdata["Answers"] >>> print(QnAdata.head()) The question and answers are joined to extract the total vocabulary used in the modeling, as we need to convert all words/characters into numeric representation. The reason is the same as mentioned before—deep learning models can't read English and everything is in numbers for the model. How to do it... After extracting the question-and-answer pairs, the following steps are needed to process the data and produce the results: Preprocessing: Convert the question-and-answer pairs into vectorized format, which will be utilized in model training. Model building and validation: Develop deep learning models and validate the data. Prediction of answers from trained model: The trained model will be used to predict answers for given questions. How it works... The question and answers are utilized to create the vocabulary of words to index mapping, which will be utilized for converting words into vector mappings: # Creating Vocabulary >>> import nltk >>> import collections >>> counter = collections.Counter() >>> for i in range(len(QnAdata)): ... for word in nltk.word_tokenize(QnAdata.iloc[i][2]): ... counter[word]+=1 >>> word2idx = {w:(i+1) for i,(w,_) in enumerate(counter.most_common())} >>> idx2word = {v:k for k,v in word2idx.items()} >>> idx2word[0] = "PAD" >>> vocab_size = len(word2idx)+1 >>> print (vocab_size) Encoding and decoding functions are used to convert text to indices and indices to text respectively. As we know, Deep learning models work on numeric values rather than text or character data: >>> def encode(sentence, maxlen,vocab_size): ... indices = np.zeros((maxlen, vocab_size)) ... for i, w in enumerate(nltk.word_tokenize(sentence)): ... if i == maxlen: break ... indices[i, word2idx[w]] = 1 ... return indices >>> def decode(indices, calc_argmax=True): ... if calc_argmax: ... indices = np.argmax(indices, axis=-1) ... return ' '.join(idx2word[x] for x in indices) The following code is used to vectorize the question and answers with the given maximum length for both questions and answers. Both might be different lengths. In some pieces of data, the question length is greater than answer length, and in a few cases, it's length is less than answer length. Ideally, the question length is good to catch the right answers. Unfortunately in this case, question length is much less than the answer length, which is a very bad example to develop generative models: >>> question_maxlen = 10 >>> answer_maxlen = 20 >>> def create_questions(question_maxlen,vocab_size): ... question_idx = np.zeros(shape=(len(Questions),question_maxlen, vocab_size)) ... for q in range(len(Questions)): ... question = encode(Questions[q],question_maxlen,vocab_size) ... question_idx[q] = question ... return question_idx >>> quesns_train = create_questions(question_maxlen=question_maxlen, vocab_size=vocab_size) >>> def create_answers(answer_maxlen,vocab_size): ... answer_idx = np.zeros(shape=(len(Answers),answer_maxlen, vocab_size)) ... for q in range(len(Answers)): ... answer = encode(Answers[q],answer_maxlen,vocab_size) ... answer_idx[q] = answer ... return answer_idx >>> answs_train = create_answers(answer_maxlen=answer_maxlen,vocab_size= vocab_size) >>> from keras.layers import Input,Dense,Dropout,Activation >>> from keras.models import Model >>> from keras.layers.recurrent import LSTM >>> from keras.layers.wrappers import Bidirectional >>> from keras.layers import RepeatVector, TimeDistributed, ActivityRegularization The following code is an important part of the chatbot. Here we have used recurrent networks, repeat vector, and time-distributed networks. The repeat vector used to match dimensions of input to output values. Whereas time-distributed networks are used to change the column vector to the output dimension's vocabulary size: >>> n_hidden = 128 >>> question_layer = Input(shape=(question_maxlen,vocab_size)) >>> encoder_rnn = LSTM(n_hidden,dropout=0.2,recurrent_dropout=0.2) (question_layer) >>> repeat_encode = RepeatVector(answer_maxlen)(encoder_rnn) >>> dense_layer = TimeDistributed(Dense(vocab_size))(repeat_encode) >>> regularized_layer = ActivityRegularization(l2=1)(dense_layer) >>> softmax_layer = Activation('softmax')(regularized_layer) >>> model = Model([question_layer],[softmax_layer]) >>> model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) >>> print (model.summary()) The following model summary describes the change in flow of model size across the model. The input layer matches the question's dimension and the output matches the answer's dimension: # Model Training >>> quesns_train_2 = quesns_train.astype('float32') >>> answs_train_2 = answs_train.astype('float32') >>> model.fit(quesns_train_2, answs_train_2,batch_size=32,epochs=30, validation_split=0.05) The results are a bit tricky in the following screenshot even though the accuracy is significantly higher. The chatbot model might produce complete nonsense, as most of the words are padding here. The reason? The number of words in this data is less: # Model prediction >>> ans_pred = model.predict(quesns_train_2[0:3]) >>> print (decode(ans_pred[0])) >>> print (decode(ans_pred[1])) The following screenshot depicts the sample output on test data. The output does not seem to make sense, which is an issue with generative models: Our model did not work well in this case, but still some areas of improvement are possible going forward with generative chatbot models. Readers can give it a try: Have a dataset with lengthy questions and answers to catch signals well Create a larger architecture of deep learning models and train over longer iterations Make question-and-answer pairs more generic rather than factoid-based, such as retrieving knowledge and so on, where generative models fail miserably. Here, you saw how to build chatbots using LSTM. You can go ahead and try building one of your own generative chatbots using the example above. If you found this post useful, do check out this book Natural Language Processing with Python Cookbook to efficiently use NLTK and implement text classification, identify parts of speech, tag words, and more.  
Read more
  • 0
  • 4
  • 21547

article-image-deploy-rethinkdb-using-docker
Vijin Boricha
14 Feb 2018
7 min read
Save for later

How to deploy RethinkDB using Docker

Vijin Boricha
14 Feb 2018
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Shahid Shaikh titled Mastering RethinkDB. This book will help you develop efficient and real-time applications in RethinkDB with ease.[/box] In today’s tutorial, we will learn to install Docker, create an Docker image and deploy RethinkDB using Docker. Your code is not working in Production? But it's working on the QA (quality analysis server)! I am sure you have heard statements like these in your team during the deployment phase. Well no more of that, Docker everything and forget about the infrastructure of different environments, say, QA, Staging and Production, because your code is going to run Docker container not in those machines, hence write once, run everywhere. In this section, we will learn how to use Docker to deploy a RethinkDB Server or PaaS services. I am going to cover a few docker basics too; if you are already aware of them, please skip to the next section. Installing Docker Docker is available for all major platforms, such as, Linux-based distributions, Mac, and Windows. Visit the official website at h t t p s ://w w w . d o c k e r . c o m / and download the package suitable for your platform. We are installing Docker in our machine to create a new Docker image. Docker images are independent of platform and should not be confused with Docker for Mac or Docker for Windows. It's referred to as a Docker client too. Once you have installed the Docker, you need to start the Daemon process first; I am using a Mac so I can view this in the launchpad, as shown here: Upon clicking that, it will open up a nice console showing the Docker official logo and an indication that Docker is successfully booted, as shown in the following screenshot: Now we can begin creating our Docker image that in turn will run RethinkDB. Creating a Docker image For installing our RethinkDB on Ubuntu inside the Docker, we need to install the Ubuntu operating system. Run the following command to install a Ubuntu image from the official Docker hub repository: docker pull ubuntu This will download and install the Ubuntu image in our system. We will later use this Ubuntu image and install our RethinkDB instance; you can choose different operating systems as well. Before going to the Docker configuration code, I would like to point out the steps we require to install RethinkDB on a fresh Ubuntu installation: Update the system Add the RethinkDB repository to the known repository list Install RethinkDB Set the data folder Expose the port We are going to do this using Docker. To create a Docker image, we require Dockerfile. Create a file called Dockerfile with no extension and apply the code shown here: FROM ubuntu:latest # Install RethinkDB. RUN apt-get update && echo "deb http://download.rethinkdb.com/apt `lsb_release -cs` main" > /etc/apt/sources.list.d/rethinkdb.list && apt-get install -y wget && wget -O- http://download.rethinkdb.com/apt/pubkey.gpg | apt-key add - && apt-get update && apt-get install -y rethinkdb python-pip && rm -rf /var/lib/apt/lists/* # Install python driver for rethinkdb RUN pip install rethinkdb # Define mountable directories. VOLUME ["/data"] # Define working directory. WORKDIR /data # Define default command. CMD ["rethinkdb", "--bind", "all"] # Expose ports. # - 8080: web UI # - 28015: process # - 29015: cluster EXPOSE 8080 EXPOSE 28015 EXPOSE 29015 The first line is our entry point to the Ubuntu operating system, then we are performing an update of the system and using the installation commands recommended by RethinkDB here: h t t p s ://w w w . r e t h i n k d b . c o m /d o c s /i n s t a l l /u b u n t u /. Once the installation is complete, we install the rethinkdb python driver to perform the import/export operation. The next two commands mount a new volume in Ubuntu and telling RethinkDB to use that volume. The next command runs rethinkdb by binding all the ports and exposing the ports to be used by the client driver and web console. In order to make this a docker image, save the file and run the following command within the project directory: docker build -t docker-rethinkdb. Here, we are building our docker image and giving it a name docker-rethinkdb; upon running this command, Docker will execute the Dockerfile and you're on. The representation of the previous steps is shown here: Once everything works, and I am sure it will, you will see a success message in the console, as shown here: Congratulations! You have successfully created a docker image for RethinkDB. If you want to see your image and its properties, run the following command: docker images And this will list all the images of Docker, as shown in the following screenshot: Awesome! Now let's run it. To access the web portal, we need to run our docker image and bind port 8080 of the docker image to some port of our machine; here is the command to do so: docker run -p 3000:8080 -d docker-rethinkdb As per the command above, -p is used to specify port binding, the first is the target and second port is source, that is, Docker port and -d is used to run it in the background or Daemon. This will run the docker image in the background; to extract more information about this process, we need to run the following command: docker ps This will list all the running images called as a container, along with the information, as shown in the following screenshot: You can also check the logs of specific containers using the following command: docker logs <container id> Now, in order to access the RethinkDB web console from our machine, we need to find out the IP address on which the Docker machine is running. To get that, we need to run the following command: docker-machine ip default This will print out the IP. Copy the IP and hit IP:3000 from the browser to view the RethinkDB web console, as shown here: So we have docker running and accessible from the browser. In order to import and export the data, we need to log in to our Docker image. To do that, run the following command: docker exec -i -t <container-id> /bin/bash This will log in to the docker image running Ubuntu; refer to the following screenshot: You can now run the rethinkdb command to perform the data import to the existing RethinkDB cluster. Deploying the Docker image Almost every PaaS service we have covered in earlier sections provides support for Docker. You can submit your Dockerfile to git and clone it anywhere if you want to create Docker image. You can submit the whole docker image (not Dockerfile) to Dockerhub and pull your docker image directly using the docker pull command, which is no doubt an easy way because you will be directly working on the image running on the server. We covered RethinkDB deployment using Docker and learned how to create our own RethinkDB image. You can learn more about RethinkDB Query Language and Performance Tuning in RethinkDB from this book Mastering RethinkDB.  
Read more
  • 0
  • 0
  • 11490

article-image-hands-on-table-calculation-techniques-with-tableau
Amarabha Banerjee
14 Feb 2018
4 min read
Save for later

Hands on Table Calculation Techniques with Tableau

Amarabha Banerjee
14 Feb 2018
4 min read
[box type="note" align="" class="" width=""]This article is a book excerpt from the title Mastering Tableau written by David Baldwin. This book will help you master the intricacies of Tableau to create effective data visualizations.[/box] Today, we shall explore Table Calculation Techniques with Tableau and explore a real world example of using these techniques. In this article, we provide a simple schema for understanding table calculations. This schema is communicated via two questions: What is the function? How is the function applied? These two questions are inexorably connected. You cannot reliably apply something until you know what it is. And you cannot get useful results from something until you correctly apply it. The sections below will help you to get a head start into Table calculation techniques and how to use Tableau functions effectively for implementing Table calculation. Basics of Table Calculation Calculated fields can be categorized as: row level, aggregate level, and table level. For row- and aggregate-level calculations, the underlying data source engine does most (if not all) of the computational work, and Tableau merely visualizes the results. For table calculations, Tableau also relies on the underlying data source engine to execute computational tasks; however, after that work is completed and a dataset is returned, Tableau performs additional processing before rendering the results. This can be seen within the following process flow diagram, in the circled part titled Tableau performs additional processing. This is where table calculations are processed. We will continue with a definition: A table calculation is a function performed on a dataset in cache that has been generated as a result of a query from Tableau to the data source. Let's consider a couple of points regarding the dataset in cache mentioned in the preceding definition. First, it is important to understand that this cache is not simply the returned results of a query. Tableau may adjust the returned results. For example,, Tableau may expand the cache via data densification. Secondly, it's important to consider how the cache is structured. Basically, the dataset in cache is a table and, like all tables, is made up of rows and columns. This is particularly important for table calculations since a table calculation may be computed as it moves along the cache. Such a table calculation is directional. Alternatively, a table calculation may be computed based on the entire cache with no directional consideration. Table calculations such as this are non-directional. Directional and nondirectional table calculations will be considered more fully in the following section. Directional and non-directional table calculation functions As of Tableau 10, there are 32 table calculation functions in Tableau. However, many of these are simply variations of a theme; for example, there are five Running functions, including RUNNING_SUM and RUNNING_AVG. If we narrow our consideration to unique groups of table calculations functions, we will discover that there are only 11. The following table shows these 11 functions organized in two categories: As mentioned previously, non-directional table calculation functions operate on the entire cache and thus are not computed based on movement through the cache. For example, the SIZE function doesn't change based on the value of a previous row in the cache. On the other hand, RUNNING_SUM does change based on previous rows in the cache and is therefore considered directional. Practical Example In this example, we'll see directional and non-directional table calculation functions in action: Navigate to h t t p s ://p u b l i c . t a b l e a u . c o m /p r o f i l e /d a v i d 1. . b a l d w i n #!/ to locate and download the workbook associated with this chapter. Navigate to the worksheet entitled Directional/Non-Directional. Create the following calculated fields: Place Category and Ship Mode on the Rows shelf. Double-click on Sales, Lookup, Size, Window Sum, Window Sum w/Start&End, and Running Sum to populate the view. Compare the following screenshot with the notes in step 3 of this exercise for better understanding: We discussed techniques for implementing table calculations with Tableau. If you liked our post, be sure to check out Mastering Tableau which consists of other useful data visualization and data analysis techniques.    
Read more
  • 0
  • 0
  • 2287
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
article-image-sentiment-analysis-of-the-2017-us-elections-on-twitter
Vijin Boricha
14 Feb 2018
12 min read
Save for later

Sentiment Analysis of the 2017 US elections on Twitter

Vijin Boricha
14 Feb 2018
12 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Andrew Morgan and Antoine Amend titled Mastering Spark for Data Science. In this book, you will learn about advanced Spark architectures, how to work with geographic data in Spark, and how to tune Spark algorithms so they scale linearly.[/box] In this article, we will learn to extract and analyse large number of tweets related to the 2017 US elections on Twitter. Following the US elections on Twitter On November 8, 2016, American citizens went in millions to polling stations to cast their votes for the next President of the United States. Counting began almost immediately and, although not officially confirmed until sometime later, the forecasted result was well known by the next morning. Let's start our investigation a couple of days before the major event itself, on November 6, 2016, so that we can preserve some context in the run-up. Although we do not exactly know what we will find in advance, we know that Twitter will play an oversized role in the political commentary given its influence in the build-up, and it makes sense to start collecting data as soon as possible. In fact, data scientists may sometimes experience this as a gut feeling - a strange and often exciting notion that compels us to commence working on something without a clear plan or absolute justification, just a sense that it will pay off. And actually, this approach can be vital since, given the normal time required to formulate and realize such a plan and the transient nature of events, a major news event may occur , a new product may have been released, or the stock market may be trending differently; by this time, the original dataset may no longer be available. Acquiring data in stream The first action is to start acquiring Twitter data. As we plan to download more than 48 hours worth of tweets, the code should be robust enough to not fail somewhere in the middle of the process; there is nothing more frustrating than a fatal NullPointerException occurring after many hours of intense processing. We know we will be working on sentiment analysis at some point down the line, but for now we do not wish to over-complicate our code with large dependencies as this can decrease stability and lead to more unchecked exceptions. Instead, we will start by collecting and storing the data and subsequent processing will be done offline on the collected data, rather than applying this logic to the live stream. We create a new Streaming context reading from Twitter 1% firehose using the utility methods. We also use the excellent GSON library to serialize Java class Status (Java class embedding Twitter4J records) to JSON objects. <dependency> <groupId>com.google.code.gson</groupId> <artifactId>gson</artifactId> <version>2.3.1</version> </dependency> We read Twitter data every 5 minutes and have a choice to optionally supply Twitter filters as command line arguments. Filters can be keywords such as Trump, Clinton or #MAGA, #StrongerTogether. However, we must bear in mind that by doing this we may not capture all relevant tweets as we can never be fully up to date with the latest hashtag trends (such as #DumpTrump, #DrainTheSwamp, #LockHerUp, or #LoveTrumpsHate) and many tweets will be overlooked with an inadequate filter, so we will use an empty filter list to ensure that we catch everything. val sparkConf = new SparkConf().setAppName("Twitter Extractor") val sc = new SparkContext(sparkConf) val ssc = new StreamingContext(sc, Minutes(5)) val filter = args val twitterStream = createTwitterStream(ssc, filter) .mapPartitions { it => val gson = new GsonBuilder().create() it.map { s: Status => Try(gson.toJson(s)).toOption } } We serialize our Status class using the GSON library and persist our JSON objects in HDFS. Note that the serialization occurs within a Try clause to ensure that unwanted exceptions are not thrown. Instead, we return JSON as an optional String: twitterStream .filter(_.isSuccess) .map(_.get) .saveAsTextFiles("/path/to/twitter") Finally, we run our Spark Streaming context and keep it alive until a new president has been elected, no matter what happens! ssc.start() ssc.awaitTermination() Acquiring data in batch Only 1% of tweets are retrieved through the Spark Streaming API, meaning that 99% of records will be discarded. Although able to download around 10 million tweets, we can potentially download more data, but this time only for a selected hashtag and within a small period of time. For example, we can download all tweets related to the #LockHerUp or #BuildTheWall hashtags. The search API For that purpose, we consume Twitter historical data through the twitter4j Java API. This library comes as a transitive dependency of spark-streaming-twitter_2.11. To use it outside of a Spark project, the following maven dependency should be used: <dependency> <groupId>org.twitter4j</groupId> <artifactId>twitter4j-core</artifactId> <version>4.0.4</version> </dependency> We create a Twitter4J client as follows: ConfigurationBuilder builder = new ConfigurationBuilder(); builder.setOAuthConsumerKey(apiKey); builder.setOAuthConsumerSecret(apiSecret); Configuration configuration = builder.build(); AccessToken token = new AccessToken( accessToken, accessTokenSecret ); Twitter twitter = new TwitterFactory(configuration) .getInstance(token); Then, we consume the /search/tweets service through the Query object: Query q = new Query(filter); q.setSince(fromDate); q.setUntil(toDate); q.setCount(400); QueryResult r = twitter.search(q); List<Status> tweets = r.getTweets(); Finally, we get a list of Status objects that can easily be serialized using the GSON library introduced earlier. Rate limit Twitter is a fantastic resource for data science, but it is far from a non-profit organization, and as such, they know how to value and price data. Without any special agreement, the search API is limited to a few days retrospective, a maximum of 180 queries per 15 minute window and 450 records per query. This limit can be confirmed on both the Twitter DEV website (https://dev.twitter.com/rest/public/rate-limits) and from the API itself using the RateLimitStatus class: Map<String, RateLimitStatus> rls = twitter.getRateLimitStatus("search"); System.out.println(rls.get("/search/tweets")); /* RateLimitStatusJSONImpl{remaining=179, limit=180, resetTimeInSeconds=1482102697, secondsUntilReset=873} */ Unsurprisingly, any queries on popular terms, such as #MAGA on November 9, 2016, hit this threshold. To avoid a rate limit exception, we have to page and throttle our download requests by keeping track of the maximum number of tweet IDs processed and monitor our status limit after each search request. RateLimitStatus strl = rls.get("/search/tweets"); int totalTweets = 0; long maxID = -1; for (int i = 0; i < 400; i++) { // throttling if (strl.getRemaining() == 0) Thread.sleep(strl.getSecondsUntilReset() * 1000L); Query q = new Query(filter); q.setSince(fromDate); q.setUntil(toDate); q.setCount(100); // paging if (maxID != -1) q.setMaxId(maxID - 1); QueryResult r = twitter.search(q); for (Status s: r.getTweets()) { totalTweets++; if (maxID == -1 || s.getId() < maxID) maxID = s.getId(); writer.println(gson.toJson(s)); } strl = r.getRateLimitStatus(); } With around half a billion tweets a day, it will be optimistic, if not Naive, to gather all USrelated data. Instead, the simple ingest process detailed earlier should be used to intercept tweets matching specific queries only. Packaged as main class in an assembly jar, it can be executed as follows: java -Dtwitter.properties=twitter.properties / -jar trump-1.0.jar #maga 2016-11-08 2016-11-09 / /path/to/twitter-maga.json Here, the twitter.properties file contains your Twitter API keys: twitter.token = XXXXXXXXXXXXXX twitter.token.secret = XXXXXXXXXXXXXX twitter.api.key = XXXXXXXXXXXXXX twitter.api.secret = XXXXXXXXXXXXXX Analysing sentiment After 4 days of intense processing, we extracted around 10 million tweets;representing approximately 30 GB worth of JSON data. Massaging Twitter data One of the key reasons Twitter became so popular is that any message has to fit into a maximum of 140 characters. The drawback is also that every message has to fit into a maximum of 140 characters! Hence, the result is massive increase in the use of abbreviations, acronyms, slang words, emoticons, and hashtags. In this case, the main emotion may no longer come from the text itself, but rather from the emoticons used (http://dl.acm.org/citation.cfm?id=1628969), though some studies showed that the emoticons may sometimes lead to inadequate predictions in sentiments (https://arxiv.org/pdf/1511.02556.pdf). Emojis are even broader than emoticons as they include pictures of animals, transportation, business icons, and so on. Also, while emoticons can easily be retrieved through simple regular expressions, emojis are usually encoded in Unicode and are more difficult to extract without a dedicated library. <dependency> <groupId>com.kcthota</groupId> <artifactId>emoji4j</artifactId> <version>5.0</version> </dependency> The Emoji4J library is easy to use (although computationally expensive) and given some text with emojis/emoticons, we can either codify – replace Unicode values with actual code names – or clean – simply remove any emojis. So firstly, let's clean our text from any junk (special characters, emojis, accents, URLs, and so on) to access plain English content: import emoji4j.EmojiUtils def clean = { var text = tweet.toLowerCase() text = text.replaceAll("https?://S+", "") text = StringUtils.stripAccents(text) EmojiUtils.removeAllEmojis(text) .trim .toLowerCase() .replaceAll("rts+", "") .replaceAll("@[wd-_]+", "") .replaceAll("[^w#[]:'.!?,]+", " ") .replaceAll("s+([:'.!?,])1", "$1") .replaceAll("[st]+", " ") .replaceAll("[rn]+", ". ") .replaceAll("(w)1{2,}", "$1$1") // avoid looooool .replaceAll("#W", "") .replaceAll("[#':,;.]$", "") .trim } Let's also codify and extract all emojis and emoticons and keep them aside as a list: val eR = "(:w+:)".r def emojis = { var text = tweet.toLowerCase() text = text.replaceAll("https?://S+", "") eR.findAllMatchIn(EmojiUtils.shortCodify(text)) .map(_.group(1)) .filter { emoji => EmojiUtils.isEmoji(emoji) }.map(_.replaceAll("W", "")) .toArray } Writing these methods inside an implicit class means that they can be applied directly a String through a simple import statement. Using the Stanford NLP Our next step is to pass our cleaned text through a Sentiment Annotator. We use the Stanford NLP library for that purpose: <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>3.5.0</version> <classifier>models</classifier> </dependency> <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>3.5.0</version> </dependency> We create a Stanford annotator that tokenizes content into sentences (tokenize), splits sentences (ssplit), tags elements (pos), and lemmatizes each word (lemma) before analyzing the overall sentiment: def getAnnotator: StanfordCoreNLP = { val p = new Properties() p.setProperty( "annotators", "tokenize, ssplit, pos, lemma, parse, sentiment" ) new StanfordCoreNLP(pipelineProps) } def lemmatize(text: String, annotator: StanfordCoreNLP = getAnnotator) = { val annotation = annotator.process(text.clean) val sentences = annotation.get(classOf[SentencesAnnotation]) sentences.flatMap { sentence => sentence.get(classOf[TokensAnnotation]) .map { token => token.get(classOf[LemmaAnnotation]) } .mkString(" ") } val text = "If you're bashing Trump and his voters and calling them a variety of hateful names, aren't you doing exactly what you accuse them?" println(lemmatize(text)) /* if you be bash trump and he voter and call they a variety of hateful name, be not you do exactly what you accuse they */ Any word is replaced by its most basic form, that is, you're is replaced with you be and aren't you doing replaced with be not you do. def sentiment(coreMap: CoreMap) = { coreMap.get(classOf[SentimentCoreAnnotations.ClassName].match { case "Very negative" => 0 case "Negative" => 1 case "Neutral" => 2 case "Positive" => 3 case "Very positive" => 4 case _ => throw new IllegalArgumentException( s"Could not get sentiment for [${coreMap.toString}]" ) } } def extractSentiment(text: String, annotator: StanfordCoreNLP = getSentimentAnnotator) = { val annotation = annotator.process(text) val sentences = annotation.get(classOf[SentencesAnnotation]) val totalScore = sentences map sentiment if (sentences.nonEmpty) { totalScore.sum / sentences.size() } else { 2.0f } } extractSentiment("God bless America. Thank you Donald Trump!") // 2.5 extractSentiment("This is the most horrible day ever") // 1.0 A sentiment spans from Very Negative (0.0) to Very Positive (4.0) and is averaged per sentence. As we do not get more than 1 or 2 sentences per tweet, we expect a very small variance; most of the tweets should be Neutral (around 2.0), with only extremes to be scored (below ~1.5 or above ~2.5). Building the Pipeline For each of our Twitter records (stored as JSON objects), we do the following things: Parse the JSON object using json4s library Extract the date Extract the text Extract the location and map it to a US state Clean the text Extract emojis Lemmatize text Analyze sentiment We then wrap all these values into the following Tweet case class: case class Tweet( date: Long, body: String, sentiment: Float, state: Option[String], geoHash: Option[String], emojis: Array[String] ) As mentioned in previous chapters, creating a new NLP instance wouldn't scale for each record out of our dataset of 10 million records. Instead, we create only one annotator per Iterator (which means one per partition): val analyzeJson = (it: Iterator[String]) => { implicit val format = DefaultFormats val annotator = getAnnotator val sdf = new SimpleDateFormat("MMM d, yyyy hh:mm:ss a") it.map { tweet => val json = parse(tweet) val dateStr = (json "createdAt").extract[String] ) As mentioned in previous chapters, creating a new NLP instance wouldn't scale for each record out of our dataset of 10 million records. Instead, we create only one annotator per Iterator (which means one per partition): val analyzeJson = (it: Iterator[String]) => { implicit val format = DefaultFormats val annotator = getAnnotator val sdf = new SimpleDateFormat("MMM d, yyyy hh:mm:ss a") it.map { tweet => val json = parse(tweet) val dateStr = (json "createdAt").extract[String] val date = Try( sdf.parse(dateStr).getTime ) .getOrElse(0L) val text = (json "text").extract[String] val location = Try( (json "user" "location").extract[String] ) .getOrElse("") .toLowerCase() val state = Try { location.split("s") .map(_.toUpperCase()) .filter { s => states.contains(s) } .head } .toOption val cleaned = text.clean Tweet( date, cleaned.lemmatize(annotator), cleaned.sentiment(annotator), state, text.emojis ) } } val tweetJsonRDD = sc.textFile("/path/to/twitter") val tweetRDD = twitterJsonRDD mapPartitions analyzeJson tweetRDD.toDF().show(5) /* +-------------+---------------+---------+--------+----------+ | date| body|sentiment| state| emojis| +-------------+---------------+---------+--------+----------+ |1478557859000|happy halloween| 2.0| None [ghost] | |1478557860000|slave to the gr| 2.5| None|[] | |1478557862000|why be he so pe| 3.0|Some(MD)|[] | |1478557862000|marcador sentim| 2.0| None|[] | |1478557868000|you mindset tow| 2.0| None|[sparkles]| +-------------+---------------+---------+--------+----------+ */ We learnt about gathering sentimental data from twitter, you can further get to know about Data Acquisition and Scrapping link-based external data from Mastering Spark for Data Science.
Read more
  • 0
  • 0
  • 3073

article-image-using-logistic-regression-predict-market-direction-algorithmic-trading
Richa Tripathi
14 Feb 2018
9 min read
Save for later

Using Logistic regression to predict market direction in algorithmic trading

Richa Tripathi
14 Feb 2018
9 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by Dr. Param Jeet and Prashant Vats titled Learning Quantitative Finance with R. This book will help you learn about various algorithmic trading techniques and ways to optimize them using the tools available in R.[/box] In this tutorial we will learn how logistic regression is used to forecast market direction. Market direction is very important for investors or traders. Predicting market direction is quite a challenging task as market data involves lots of noise. The market moves either upward or downward and the nature of market movement is binary. A logistic regression model help us to fit a model using binary behavior and forecast market direction. Logistic regression is one of the probabilistic models which assigns probability to each event. We are going to use the quantmod package. The next three commands are used for loading the package into the workspace, importing data into R from the yahoo repository and extracting only the closing price from the data: >library("quantmod") >getSymbols("^DJI",src="yahoo") >dji<- DJI[,"DJI.Close"] The input data to the logistic regression is constructed using different indicators, such as moving average, standard deviation, RSI, MACD, Bollinger Bands, and so on, which has some predictive power in market direction, that is, Up or Down. These indicators can be constructed using the following commands: >avg10<- rollapply(dji,10,mean) >avg20<- rollapply(dji,20,mean) >std10<- rollapply(dji,10,sd) >std20<- rollapply(dji,20,sd) >rsi5<- RSI(dji,5,"SMA") >rsi14<- RSI(dji,14,"SMA") >macd12269<- MACD(dji,12,26,9,"SMA") >macd7205<- MACD(dji,7,20,5,"SMA") >bbands<- BBands(dji,20,"SMA",2) The following commands are to create variable direction with either Up direction (1) or Down direction (0). Up direction is created when the current price is greater than the 20 days previous price and Down direction is created when the current price is less than the 20 days previous price: >direction<- NULL >direction[dji> Lag(dji,20)] <- 1 >direction[dji< Lag(dji,20)] <- 0 Now we have to bind all columns consisting of price and indicators, which is shown in the following command: >dji<- cbind(dji,avg10,avg20,std10,std20,rsi5,rsi14,macd12269,macd7205,bbands,dire ction) The dimension of the dji object can be calculated using dim(). I used dim() over dji and saved the output in dm(). dm() has two values stored: the first value is the number of rows and the second value is the number of columns in dji. Column names can be extracted using colnames(). The third command is used to extract the name for the last column. Next I replaced the column name with a particular name, Direction: >dm<- dim(dji) >dm [1] 2493   16 >colnames(dji)[dm[2]] [1] "..11" >colnames(dji)[dm[2]] <- "Direction" >colnames(dji)[dm[2]] [1] "Direction" We have extracted the Dow Jones Index (DJI) data into the R workspace. Now, to implement logistic regression, we should divide the data into two parts. The first part is in- sample data and the second part is out-sample data. In-sample data is used for the model building process and out-sample data is used for evaluation purposes. This process also helps to control the variance and bias in the model. The next four lines are for in-sample start, in-sample end, out-sample start, and out-sample end dates: >issd<- "2010-01-01" >ised<- "2014-12-31" >ossd<- "2015-01-01" >osed<- "2015-12-31" The following two commands are to get the row number for the dates, that is, the variable isrow extracts row numbers for the in-sample date range and osrow extracts the row numbers for the out-sample date range: >isrow<- which(index(dji) >= issd& index(dji) <= ised) >osrow<- which(index(dji) >= ossd& index(dji) <= osed) The variables isdji and osdji are the in-sample and out-sample datasets respectively: >isdji<- dji[isrow,] >osdji<- dji[osrow,] If you look at the in-sample data, that is, isdji, you will realize that the scaling of each column is different: a few columns are in the scale of 100, a few others are in the scale of 10,000, and a few others are in the scale of 1. Difference in scaling can put your results in trouble as higher weights are being assigned to higher scaled variables. So before moving ahead, you should consider standardizing the dataset. I will use the following formula: standardized data =  The mean and standard deviation of each column using apply() can be seen here: >isme<- apply(isdji,2,mean) >isstd<- apply(isdji,2,sd) An identity matrix of dimension equal to the in-sample data is generated using the following command, which is going to be used for normalization: >isidn<- matrix(1,dim(isdji)[1],dim(isdji)[2]) Use formula 6.1 to standardize the data: >norm_isdji<-  (isdji - t(isme*t(isidn))) / t(isstd*t(isidn)) The preceding line also standardizes the direction column, that is, the last column. We don't want direction to be standardized so I replace the last column again with variable direction for the in-sample data range: >dm<- dim(isdji) >norm_isdji[,dm[2]] <- direction[isrow] Now we have created all the data required for model building. You should build a logistic regression model and it will help you to predict market direction based on in-sample data. First, in this step, I created a formula which has direction as dependent and all other columns as independent variables. Then I used a generalized linear model, that is, glm(), to fit a model which has formula, family, and dataset: >formula<- paste("Direction ~ .",sep="") >model<- glm(formula,family="binomial",norm_isdji) A summary of the model can be viewed using the following command: >summary(model) Next use predict() to fit values on the same dataset to estimate the best fitted value: >pred<- predict(model,norm_isdji) Once you have fitted the values, you should try to convert it to probability using the following command. This will convert the output into probabilistic form and the output will be in the range [0,1]: >prob<- 1 / (1+exp(-(pred))) The figure shown below is plotted using the following commands. The first line of the code shows that we divide the figure into two rows and one column, where the first figure is for prediction of the model and the second figure is for probability: >par(mfrow=c(2,1)) >plot(pred,type="l") >plot(prob,type="l") head() can be used to look at the first few values of the variable: >head(prob) 2010-01-042010-01-05 2010-01-06 2010-01-07 0.8019197  0.4610468  0.7397603  0.9821293 The following figure shows the above-defined variable pred, which is a real number, and its conversion between 0 and 1, which represents probability, that is, prob, using the preceding transformation: Figure 6.1: Prediction and probability distribution of DJI As probabilities are in the range of (0,1) so is our vector prob. Now, to classify them as one of the two classes, I considered Up direction (1) when prob is greater than 0.5 and Down direction (0) when prob is less than 0.5. This assignment can be done using the following commands. prob> 0.5 generate true for points where it is greater and pred_direction[prob> 0.5] assigns 1 to all such points. Similarly, the next statement shows assignment 0 when probability is less than or equal to 0.5: >pred_direction<- NULL >pred_direction[prob> 0.5] <- 1 >pred_direction[prob<= 0.5] <- 0 Once we have figured out the predicted direction, we should check model accuracy: how much our model has predicted Up direction as Up direction and Down as Down. There might be some scenarios where it predicted the opposite of what it is, such as predicting down when it is actually Up and vice versa. We can use the caret package to calculate confusionMatrix(), which gives a matrix as an output. All diagonal elements are correctly predicted and off-diagonal elements are errors or wrongly predicted. One should aim to reduce the off-diagonal elements in a confusion matrix: >install.packages('caret') >library(caret) >matrix<- confusionMatrix(pred_direction,norm_isdji$Direction) >matrix Confusion Matrix and Statistics Reference Prediction               0                     1 0            362                    35 1             42                   819 Accuracy : 0.9388        95% CI : (0.9241, 0.9514) No Information Rate : 0.6789   P-Value [Acc>NIR] : <2e-16 Kappa : 0.859                        Mcnemar's Test P-Value : 0.4941 Sensitivity : 0.8960                        Specificity : 0.9590 PosPredValue : 0.9118   NegPred Value : 0.9512 Prevalence : 0.3211                          Detection Rate : 0.2878 Detection Prevalence : 0.3156 Balanced Accuracy : 0.9275 The preceding table shows we have got 94% correct prediction, as 362+819 = 1181 are correct predictions out of 1258 (sum of all four values). Prediction above 80% over in-sample data is generally assumed good prediction; however, 80% is not fixed, one has to figure out this value based on the dataset and industry. Now you have implemented the logistic regression model, which has predicted 94% correctly, and need to test it for generalization power. One should test this model using out-sample data and test its accuracy. The first step is to standardize the out-sample data using formula (6.1). Here mean and standard deviations should be the same as those used for in-sample normalization: >osidn<- matrix(1,dim(osdji)[1],dim(osdji)[2]) >norm_osdji<-  (osdji - t(isme*t(osidn))) / t(isstd*t(osidn)) >norm_osdji[,dm[2]] <- direction[osrow] Next we use predict() on the out-sample data and use this value to calculate probability: >ospred<- predict(model,norm_osdji) >osprob<- 1 / (1+exp(-(ospred))) Once probabilities are determined for the out-sample data, you should put it into either Up or Down classes using the following commands. ConfusionMatrix() here will generate a matrix for the out-sample data: >ospred_direction<- NULL >ospred_direction[osprob> 0.5] <- 1 >ospred_direction[osprob<= 0.5] <- 0 >osmatrix<- confusionMatrix(ospred_direction,norm_osdji$Direction) >osmatrix Confusion Matrix and Statistics Reference Prediction            0                       1 0          115                     26 1           12                     99 Accuracy : 0.8492       95% CI : (0.7989, 0.891) This shows 85% accuracy on the out-sample data. A realistic trading model also accounts for trading cost and market slippage, which decrease the winning odds significantly. We presented advanced techniques implemented in capital markets and also learned logistic regression model using binary behavior to forecast market direction. If you enjoyed this excerpt, check out the book  Learning Quantitative Finance with R to deep dive into the vast world of algorithmic and machine-learning based trading.    
Read more
  • 0
  • 1
  • 9927

article-image-how-to-perform-data-exploration-with-rethinkdb
Vijin Boricha
14 Feb 2018
5 min read
Save for later

How to perform Data exploration with RethinkDB

Vijin Boricha
14 Feb 2018
5 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Shahid Shaikh titled Mastering RethinkDB. This book will let you master the capabilities of RethinkDB and implement them to develop efficient real-time web applications.[/box] In this article, we will learn to do data exploration in RethinkDB with the help of few use case. Executing data exploration use cases We have imported our database i.e. our mock data into our RethinkDB instance. Now it's time to run a use case query and make use of it. But before we do so, we need to figure out one data alteration. We have made a mistake while generating mock data (on purpose actually) we have a $ sign before ctc. Hence, it becomes tough to perform salary-level queries. Before we move ahead, we need to figure out this problem, and basically get rid of the $ sign and update the ctc value to an integer instead of a string. In order to do this, we need to perform the following operation: Traverse through each document in the database Split the ctc string into two parts, containing $ and the other value Update the ctc value in the document with a new data type and value Since we require the chaining of queries, I have written a small snippet in Node.js to achieve the previous scenario as follows: var rethinkdb = require('rethinkdb'); var connection = null; rethinkdb.connect({host : 'localhost', port : 28015},function(err,conn) { if(err) { throw new Error('Connection error'); } connection = conn; rethinkdb.db("company").table("employees") .run(connection,function(err,cursor) { if(err) { throw new Error(err); } cursor.each(function(err,data) { data.ctc = parseInt(data.ctc.split("$")[1]); rethinkdb.db("company").table("employees") .get(data.id) .update({ctc : data.ctc}) .run(connection,function(err,response) { if(err) { throw new Error(err); } console.log(response); }); }); }); }); As you can see in the preceding code, we first fetch all the documents and traverse them using cursor, one document at a time. We use the split() method as a $ separator and convert the outcome, which is salary, into an integer using the parseInt() method. We update each document at a time using the id value of the document: After selecting all the documents again, we can see an updated ctc value as an integer, as shown in the following figure: This is one of the practical examples where we perform some data manipulation before moving ahead with complex queries. Similarly, you can look for errors such as blank spaces in a specific field or duplicate elements in your record. Finding duplicate elements We can use distinct() to find out whether there is any duplicate element present in the table. Say you have 1,000 rows and there are 10 duplicates. In order to determine that, we just need to find out the unique rows (of course excluding the ID key, as that's unique by nature). Here is the query for the same: r.db("company").table('employees').without('id').distinct().count() As shown in the following screenshot, this query returns the count of unique rows, which should be 1,000 if there are no duplicates: This implies that our records contain no duplicate documents. Finding the list of countries We can write a query to find all the countries we have in our record and also use distinct again by just selecting the country field. Here is the query: r.db("company").table('employees')("country").distinct() As shown in this image, we have 124 countries in our records: Finding the top 10 employees with the highest salary In this use case, we need to evaluate all the records and find the top 10 employees with the highest to lowest pay. Here is the query for the same: r.db("company").table("employees").orderBy(r.desc("ctc")).limit(10) Here we are using orderBy, which by default orders the record in ascending order. To get the highest pay at the first document, we need to use descending ordering; we did it using the desc() ReQL command. As shown in the following image, the query returns 10 rows: You can modify the same query by just by limiting the number of users to one to get thehighest-paid employee. Displaying employee records with a specific name and location To extract such records from our table, we need to again perform a filter on the "first_name" and "country" fields. Here is the query to return those records: r.db("company").table('employees').filter({"first_name" : "John","country" : "Sweden"}) We are just performing a basic filter and comparing both fields. ReQL queries are really easy for solving such queries due to their chaining feature. After executing the preceding query, we show the following output: To summarize, we looked over a few use cases where we had to perform alteration and filtering of records in order to meet exploration task, like stripping the $ sign from ctc, or converting base 256 ip addresses into base 10 values and then performing a query on them. We also covered a general use case in order to get a practical feel of ReQL. If you are interested to learn about RethinkDB Query Language,  Extending RethinkDB, and more you may check out this book Mastering RethinkDB.  
Read more
  • 0
  • 0
  • 1830

article-image-getting-started-with-data-visualization-in-tableau
Amarabha Banerjee
13 Feb 2018
5 min read
Save for later

Getting started with Data Visualization in Tableau

Amarabha Banerjee
13 Feb 2018
5 min read
[box type="note" align="" class="" width=""]This article is an book extract from Mastering Tableau, written by David Baldwin. Tableau has emerged as one of the most popular Business Intelligence solutions in recent times, thanks to its powerful and interactive data visualization capabilities. This book will empower you to become a master in Tableau by exploiting the many new features introduced in Tableau 10.0.[/box] In today’s post, we shall explore data visualization basics with Tableau and explore a real world example using these techniques. Tableau Software has a focused vision resulting in a small product line. The main product (and hence the center of the Tableau universe) is Tableau Desktop. Assuming you are a Tableau author, that's where almost all your time will be spent when working with Tableau. But of course you must be able to connect to data and output the results. Thus, as shown in the following figure, the Tableau universe encompasses data sources, Tableau Desktop, and output channels, which include the Tableau Server family and Tableau Reader: Worksheet and dashboard creation At the heart of Tableau are worksheets and dashboards. Worksheets contain individual visualizations and dashboards contain one or more worksheets. Additionally, worksheets and dashboards may be combined into stories to communicate specific insights to the end user via a presentation environment. Lastly, all worksheets, dashboards, and stories are organized in workbooks that can be accessed via Tableau Desktop, Server, or Reader. In this section, we will look at worksheet and dashboard creation with the intent of not only communicating the basics, but also providing some insight that may prove helpful to even more seasoned Tableau authors. Worksheet creation At the most fundamental level, a visualization in Tableau is created by placing one or more fields on one or more shelves. To state this as a pseudo-equation: Field(s) + shelf(s) = Viz As an example, note that the visualization created in the following screenshot is generated by placing the Sales field on the Text shelf. Although the results are quite simple – a single number – this does qualify as a view. In other words, a Field (Sales) placed on a shelf (Text) has generated a Viz: Exercise – fundamentals of visualizations Let's explore the basics of creating a visualization via an exercise: Navigate to h t t p s ://p u b l i c . t a b l e a u . c o m /p r o f i l e /d a v i d 1. . b a l d w i n #!/ to locate and download the workbook associated with this chapter. In the workbook, find the tab labeled Fundamentals of Visualizations: Locate Region within the Dimensions portion of the Data pane: Drag Region to the Color shelf; that is, Region + Color shelf = what is shown in the following screenshot: Click on the Color shelf and then on Edit Colors… to adjust colors as desired: Next, move Region to the Size, Label/Text, Detail, Columns, and Rows shelves. After placing Region on each shelf, click on the shelf to access additional options. Lastly, choose other fields to drop on various shelves to continue exploring Tableau's behavior. As you continue exploring Tableau's behavior by dragging and dropping different fields onto different shelves, you will notice that Tableau responds with default behaviors. These defaults, however, can be overridden, which we will explore in the following section. Dashboard Creation Although, as stated previously, a dashboard contains one or more worksheets, dashboards are much more than static presentations. They are an essential part of Tableau's interactivity. In this section, we will populate a dashboard with worksheets and then deploy actions for interactivity. Exercise – building a dashboard In the workbook associated with this chapter, navigate to the tab entitled Building a Dashboard. Within the Dashboard pane located on the left-hand portion of the screen, double-click on each of the following worksheets (in the order in which they are listed) to add them to the dashboard: US Sales Customer Segment Scatter Plot Customers In the lower right-hand corner of the dashboard, click in the blank area below Profit Ratio to select the vertical container: After clicking in the blank area, you should see a blue border around the filter and the legends. This indicates that the vertical container is selected. As shown in the following screenshot, select the vertical container handle and drag it to the left-hand side of the Customers worksheet. Note the gray shading, which communicates where the container will be placed: The gray shading (provided by Tableau when dragging elements such as worksheets and containers onto a dashboard) helpfully communicates where the element will be placed. Take your time and observe carefully when placing an element on a dashboard or the results may be Unexpected. 5. Format the dashboard as desired. The following tips may prove helpful: Adjust the sizes of the elements on the screen by hovering over the edges between each element and then clicking and dragging as Desired. 2. Note that the Sales and Profit legends in the following screenshot are floating elements. Make an element float by right-clicking on the element handle and selecting Floating. (See the previous screenshot and note that the handle is located immediately above Region, in the upper-right-hand corner). 3. Create Horizontal and Vertical containers by dragging those objects from the bottom portion of the Dashboard pane. 4. Drag the edges of containers to adjust the size of each worksheet. 5. Display the dashboard title via Dashboard | Show Title…: If you enjoyed our post, be sure to check out Mastering Tableau which consists of many useful data visualization and data analysis techniques.  
Read more
  • 0
  • 0
  • 3350
article-image-tableau-data-handling-engine-offer
Amarabha Banerjee
13 Feb 2018
6 min read
Save for later

What Tableau Data Handling Engine has to offer

Amarabha Banerjee
13 Feb 2018
6 min read
[box type="note" align="" class="" width=""]This article is taken from the book Mastering Tableau, written by David Baldwin. This book will equip you with all the information needed to create effective dashboards and data visualization solutions using Tableau.[/box] In today’s tutorial, we shall explore the Tableau data handling engine and a real world example of how to use it. Tableau's data-handling engine is usually not well comprehended by even advanced authors because it's not an overt part of day-to-day activities; however, for the author who wants to truly grasp how to ready data for Tableau, this understanding is indispensable. In this section, we will explore Tableau's data-handling engine and how it enables structured yet organic data mining processes in the enterprise. To begin, let's clarify a term. The phrase Data-Handling Engine (DHE) in this context references how Tableau interfaces with and processes data. This interfacing and processing is comprised of three major parts: Connection, Metadata, and VizQL. Each part is described in detail in the following section. In other publications, Tableau's DHE may be referred to as a metadata model or the Tableau infrastructure. I've elected not to use either term because each is frequently defined differently in different contexts, which can be quite confusing. Tableau's DHE (that is, the engine for interfacing with and processing data) differs from other broadly considered solutions in the marketplace. Legacy business intelligence solutions often start with structuring the data for an entire enterprise. Data sources are identified, connections are established, metadata is defined, a model is created, and more. The upfront challenges this approach presents are obvious: highly skilled professionals, time-intensive rollout, and associated high startup costs. The payoff is a scalable, structured solution with detailed documentation and process control. Many next generation business intelligence platforms claim to minimize or completely do away with the need for structuring data. The upfront challenges are minimized: specialized skillsets are not required and the rollout time and associated startup costs are low. However, the initial honeymoon is short-lived, since the total cost of ownership advances significantly when difficulties are encountered trying to maintain and scale the solution. Tableau's infrastructure represents a hybrid approach, which attempts to combine the advantages of legacy business intelligence solutions with those of next-generation platforms, while minimizing the shortcomings of both. The philosophical underpinnings of Tableau's hybrid approach include the following: Infrastructure present in current systems should be utilized when advantageous Data models should be accessible by Tableau but not required DHE components as represented in Tableau should be easy to modify DHE components should be adjustable by business users The Tableau Data-Handling Engine The preceding diagram shows that the DHE consists of a run time module (VizQL) and two layers of abstraction (Metadata and Connection). Let's begin at the bottom of the graphic by considering the first layer of abstraction, Connection. The most fundamental aspect of the Connection is a path to the data source. The path should include attributes for the database, tables, and views as applicable. The Connection may also include joins, custom SQL, data-source filters, and more. In keeping with Tableau's philosophy of easy to modify and adjustable by business users (see the previous section), each of these aspects of the Connection is easily modifiable. For example, an author may choose to add an additional table to a join or modify a data-source filter. Note that the Connection does not contain any of the actual data. Although an author may choose to create a data extract based on data accessed by the Connection, that extract is separate from the connection. The next layer of abstraction is the metadata. The most fundamental aspect of the Metadata layer is the determination of each field as a measure or dimension. When connecting to relational data, Tableau makes the measure/dimension determination based on heuristics that consider the data itself as well as the data source's data types. Other aspects of the metadata include aliases, data types, defaults, roles, and more. Additionally, the Metadata layer encompasses author-generated fields such as calculations, sets, groups, hierarchies, bins, and so on. Because the Metadata layer is completely separate from the Connection layer, it can be used with other Connection layers; that is, the same metadata definitions can be used with different data sources. VizQL is generated when a user places a field on a shelf. The VizQL is then translated into Structured Query Language (SQL), Multidimensional Expressions(MDX), or Tableau Query Language (TQL) and passed to the backend data source via a driver. The following two aspects of the VizQL module are of primary importance: VizQL allows the author to change field attributions on the fly VizQL enables table calculations Let's consider each of these aspects of VizQL via examples: Changing field attribution example An analyst is considering infant mortality rates around the world. Using data from h t t p://d a t a . w o r l d b a n k . o r g /, they create the following worksheet by placing AVG(Infant Mortality Rate) and Country on the Columns and Rows shelves, respectively. AVG(Infant Mortality Rate) is, of course, treated as a measure in this case: Next they create a second worksheet to analyze the relationship between Infant Mortality Rate and Health Exp/Capita (that is, health expenditure per capita). In order to accomplish this, they define Infant Mortality Rate as a dimension, as shown in the following Screenshot: Studying the SQL generated by VizQL to create the preceding visualization is particularly Insightful: SELECT ['World Indicators$'].[Infant Mortality Rate] AS [Infant Mortality Rate], AVG(['World Indicators$'].[Health Exp/Capita]) AS [avg:Health Exp/Capita:ok] FROM [dbo].['World Indicators$'] ['World Indicators$'] GROUP BY ['World Indicators$'].[Infant Mortality Rate] The Group By clause clearly communicates that Infant Mortality Rate is treated as a dimension. The takeaway is to note that VizQL enabled the analyst to change the field usage from measure to dimension without adjusting the source metadata. This on-the-fly ability enables creative exploration of the data not possible with other tools and avoids lengthy exercises attempting to define all possible uses for each field. If you liked our article, be sure to check out Mastering Tableau which consists of more useful data visualization and data analysis techniques.  
Read more
  • 0
  • 0
  • 2217

article-image-how-to-build-a-music-recommendation-system-with-pagerank-algorithm
Vijin Boricha
13 Feb 2018
6 min read
Save for later

How to Build a music recommendation system with PageRank Algorithm

Vijin Boricha
13 Feb 2018
6 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book Mastering Spark for Data Science written by Andrew Morgan and Antoine Amend. In this book, you will learn about advanced Spark architectures, how to work with geographic data in Spark, and how to tune Spark algorithms to scale them linearly.[/box] In today’s tutorial, we will learn to build a recommender with PageRank algorithm. The PageRank algorithm Instead of recommending a specific song, we will recommend playlists. A playlist would consist of a list of all our songs ranked by relevance, most to least relevant. Let's begin with the assumption that people listen to music in a similar way to how they browse articles on the web, that is, following a logical path from link to link, but occasionally switching direction, or teleporting, and browsing to a totally different website. Continuing with the analogy, while listening to music one can either carry on listening to music of a similar style (and hence follow their most expected journey), or skip to a random song in a totally different genre. It turns out that this is exactly how Google ranks websites by popularity using a PageRank algorithm. For more details on the PageRank algorithm visit: h t t p ://i l p u b s . s t a n f o r d . e d u :8090/422/1/1999- 66. p d f . The popularity of a website is measured by the number of links it points to (and is referred from). In our music use case, the popularity is built as the number hashes a given song shares with all its neighbors. Instead of popularity, we introduce the concept of song commonality. Building a Graph of Frequency Co-occurrence We start by reading our hash values back from Cassandra and re-establishing the list of song IDs for each distinct hash. Once we have this, we can count the number of hashes for each song using a simple reduceByKey function, and because the audio library is relatively small, we collect and broadcast it to our Spark executors: val hashSongsRDD = sc.cassandraTable[HashSongsPair]("gzet", "hashes") val songHashRDD = hashSongsRDD flatMap { hash => hash.songs map { song => ((hash, song), 1) } } val songTfRDD = songHashRDD map { case ((hash,songId),count) => (songId, count) } reduceByKey(_+_) val songTf = sc.broadcast(songTfRDD.collectAsMap()) Next, we build a co-occurrence matrix by getting the cross product of every song sharing a same hash value, and count how many times the same tuple is observed. Finally, we wrap the song IDs and the normalized (using the term frequency we just broadcast) frequency count inside of an Edge class from GraphX: implicit class Crossable[X](xs: Traversable[X]) { def cross[Y](ys: Traversable[Y]) = for { x <- xs; y <- ys } yield (x, y) val crossSongRDD = songHashRDD.keys .groupByKey() .values .flatMap { songIds => songIds cross songIds filter { case (from, to) => from != to }.map(_ -> 1) }.reduceByKey(_+_) .map { case ((from, to), count) => val weight = count.toDouble / songTfB.value.getOrElse(from, 1) Edge(from, to, weight) }.filter { edge => edge.attr > minSimilarityB.value } val graph = Graph.fromEdges(crossSongRDD, 0L) We are only keeping edges with a weight (meaning a hash co-occurrence) greater than a predefined threshold in order to build our hash frequency graph. Running PageRank Contrary to what one would normally expect when running a PageRank, our graph is undirected. It turns out that for our recommender, the lack of direction does not matter, since we are simply trying to find similarities between Led Zeppelin and Spirit. A possible way of introducing direction could be to look at the song publishing date. In order to find musical influences, we could certainly introduce a chronology from the oldest to newest songs giving directionality to our edges. In the following pageRank, we define a probability of 15% to skip, or teleport as it is known, to any random song, but this can be obviously tuned for different needs: val prGraph = graph.pageRank(0.001, 0.15) Finally, we extract the page ranked vertices and save them as a playlist in Cassandra via an RDD of the Song case class: case class Song(id: Long, name: String, commonality: Double) val vertices = prGraph .vertices .mapPartitions { vertices => val songIds = songIdsB .value .vertices .map { case (songId, pr) => val songName = songIds.get(vId).get Song(songId, songName, pr) } } vertices.saveAsCassandraTable("gzet", "playlist") The reader may be pondering the exact purpose of PageRank here, and how it could be used as a recommender? In fact, our use of PageRank means that the highest ranking songs would be the ones that share many frequencies with other songs. This could be due to a common arrangement, key theme, or melody; or maybe because a particular artist was a major influence on a musical trend. However, these songs should be, at least in theory, more popular (by virtue of the fact they occur more often), meaning that they are more likely to have mass appeal. On the other end of the spectrum, low ranking songs are ones where we did not find any similarity with anything we know. Either these songs are so avant-garde that no one has explored these musical ideas before, or alternatively are so bad that no one ever wanted to copy them! Maybe they were even composed by that up-and-coming artist you were listening to in your rebellious teenage years. Either way, the chance of a random user liking these songs is treated as negligible. Surprisingly, whether it is a pure coincidence or whether this assumption really makes sense, the lowest ranked song from this particular audio library is Daft Punk's–Motherboard it is a title that is quite original (a brilliant one though) and a definite unique sound. To summarize, we have learnt how to build a complete recommendation system for a song playlist. You can check out the book Mastering Spark for Data Science to deep dive into Spark and deliver other production grade data science solutions. Read our post on how deep learning is revolutionizing the music industry. And here is how you can analyze big data using the pagerank algorithm.  
Read more
  • 0
  • 0
  • 5321

article-image-hypothesis-testing-with-r
Richa Tripathi
13 Feb 2018
8 min read
Save for later

Hypothesis testing with R

Richa Tripathi
13 Feb 2018
8 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Learning Quantitative Finance with R written by Dr. Param Jeet and Prashant Vats. This book will help you understand the basics of R and how they can be applied in various Quantitative Finance scenarios.[/box] Hypothesis testing is used to reject or retain a hypothesis based upon the measurement of an observed sample. So in today’s tutorial we will discuss how to implement the various scenarios of hypothesis testing in R. Lower tail test of population mean with known variance The null hypothesis is given by where is the hypothesized lower bound of the population mean. Let us assume a scenario where an investor assumes that the mean of daily returns of a stock since inception is greater than $10. The average of 30 days' daily return sample is $9.9. Assume the population standard deviation is 0.011. Can we reject the null hypothesis at .05 significance level? Now let us calculate the test statistics z which can be computed by the following code in R: > xbar= 9.9 > mu0 = 10 > sig = 1.1 > n = 30 > z = (xbar-mu0)/(sig/sqrt(n)) > z Here: xbar: Sample mean mu: Hypothesized value sig: Standard deviation of population n: Sample size z: Test statistics This gives the value of z the test statistics: [1] -0.4979296 Now let us find out the critical value at 0.05 significance level. It can be computed by the following code: > alpha = .05 > z.alpha = qnorm(1-alpha) > -z.alpha This gives the following output: [1] -1.644854 Since the value of the test statistics is greater than the critical value, we fail to reject the null hypothesis claim that the return is greater than $10. In place of using the critical value test, we can use the pnorm function to compute the lower tail of Pvalue test statistics. This can be computed by the following code: > pnorm(z) This gives the following output: [1] 0.3092668 Since the Pvalue is greater than 0.05, we fail to reject the null hypothesis. Upper tail test of population mean with known variance The null hypothesis is given by  where  is the hypothesized upper bound of the population mean. Let us assume a scenario where an investor assumes that the mean of daily returns of a stock since inception is at most $5. The average of 30 days' daily return sample is $5.1. Assume the population standard deviation is 0.25. Can we reject the null hypothesis at .05 significance level? Now let us calculate the test statistics z, which can be computed by the following code in R: > xbar= 5.1 > mu0 = 5 > sig = .25 > n = 30 > z = (xbar-mu0)/(sig/sqrt(n)) > z Here: xbar: Sample mean mu0: Hypothesized value sig: Standard deviation of population n: Sample size z: Test statistics It gives 2.19089 as the value of test statistics. Now let us calculate the critical value at .05 significance level, which is given by the following code: > alpha = .05 > z.alpha = qnorm(1-alpha) > z.alpha This gives 1.644854, which is less than the value computed for the test statistics. Hence we reject the null hypothesis claim. Also, the Pvalue of the test statistics is given as follows: >pnorm(z, lower.tail=FALSE) This gives 0.01422987, which is less than 0.05 and hence we reject the null hypothesis. Two-tailed test of population mean with known variance The null hypothesis is given by  where  is the hypothesized value of the population mean. Let us assume a scenario where the mean of daily returns of a stock last year is $2. The average of 30 days' daily return sample is $1.5 this year. Assume the population standard deviation is .2. Can we reject the null hypothesis that there is not much significant difference in returns this year from last year at .05 significance level? Now let us calculate the test statistics z, which can be computed by the following code in R: > xbar= 1.5 > mu0 = 2 > sig = .1 > n = 30 > z = (xbar-mu0)/(sig/sqrt(n)) > z This gives the value of test statistics as -27.38613. Now let us try to find the critical value for comparing the test statistics at .05 significance level. This is given by the following code: >alpha = .05 >z.half.alpha = qnorm(1-alpha/2) >c(-z.half.alpha, z.half.alpha) This gives the value -1.959964, 1.959964. Since the value of test statistics is not between the range (-1.959964, 1.959964), we reject the claim of the null hypothesis that there is not much significant difference in returns this year from last year at .05 significance level. The two-tailed Pvalue statistics is given as follows: >2*pnorm(z) This gives a value less than .05 so we reject the null hypothesis. In all the preceding scenarios, the variance is known for population and we use the normal distribution for hypothesis testing. However, in the next scenarios, we will not be given the variance of the population so we will be using t distribution for testing the hypothesis. Lower tail test of population mean with unknown variance The null hypothesis is given by  where  is the hypothesized lower bound of the population mean. Let us assume a scenario where an investor assumes that the mean of daily returns of a stock since inception is greater than $1. The average of 30 days' daily return sample is $.9. Assume the population standard deviation is 0.01. Can we reject the null hypothesis at .05 significance level? In this scenario, we can compute the test statistics by executing the following code: > xbar= .9 > mu0 = 1 > sig = .1 > n = 30 > t = (xbar-mu0)/(sig/sqrt(n)) > t Here: xbar: Sample mean mu0: Hypothesized value sig: Standard deviation of sample n: Sample size t: Test statistics This gives the value of the test statistics as -5.477226. Now let us compute the critical value at .05 significance level. This is given by the following code: > alpha = .05 > t.alpha = qt(1-alpha, df=n-1) > -t.alpha We get the value as -1.699127. Since the value of the test statistics is less than the critical value, we reject the null hypothesis claim. Now instead of the value of the test statistics, we can use the Pvalue associated with the test statistics, which is given as follows: >pt(t, df=n-1) This results in a value less than .05 so we can reject the null hypothesis claim. Upper tail test of population mean with unknown variance The null hypothesis is given by where  is the hypothesized upper bound of the population mean. Let us assume a scenario where an investor assumes that the mean of daily returns of a stock since inception is at most $3. The average of 30 days' daily return sample is $3.1. Assume the population standard deviation is .2. Can we reject the null hypothesis at .05 significance level? Now let us calculate the test statistics t which can be computed by the following code in R: > xbar= 3.1 > mu0 = 3 > sig = .2 > n = 30 > t = (xbar-mu0)/(sig/sqrt(n)) > t Here: xbar: Sample mean mu0: Hypothesized value sig: Standard deviation of sample n: Sample size t: Test statistics This gives the value 2.738613 of the test statistics. Now let us find the critical value associated with the .05 significance level for the test statistics. It is given by the following code: > alpha = .05 > t.alpha = qt(1-alpha, df=n-1) > t.alpha Since the critical value 1.699127 is less than the value of the test statistics, we reject the null hypothesis claim. Also, the value associated with the test statistics is given as follows: >pt(t, df=n-1, lower.tail=FALSE) This is less than .05. Hence the null hypothesis claim gets rejected. Two tailed test of population mean with unknown variance The null hypothesis is given by , where  is the hypothesized value of the population mean. Let us assume a scenario where the mean of daily returns of a stock last year is $2. The average of 30 days' daily return sample is $1.9 this year. Assume the population standard deviation is .1. Can we reject the null hypothesis that there is not much significant difference in returns this year from last year at .05 significance level? Now let us calculate the test statistics t, which can be computed by the following code in R: > xbar= 1.9 > mu0 = 2 > sig = .1 > n = 30 > t = (xbar-mu0)/(sig/sqrt(n)) > t This gives -5.477226 as the value of the test statistics. Now let us try to find the critical value range for comparing, which is given by the following code: > alpha = .05 > t.half.alpha = qt(1-alpha/2, df=n-1) > c(-t.half.alpha, t.half.alpha) This gives the range value (-2.04523, 2.04523). Since this is the value of the test statistics, we reject the claim of the null hypothesis We learned how to practically perform one-tailed/ two-tailed hypothesis testing with known as well as unknown variance using R. If you enjoyed this excerpt, check out the book  Learning Quantitative Finance with R to explore different methods to manage risks and trading using Machine Learning with R.
Read more
  • 0
  • 0
  • 6975
article-image-how-to-maintain-apache-mesos
Vijin Boricha
13 Feb 2018
6 min read
Save for later

How to maintain Apache Mesos

Vijin Boricha
13 Feb 2018
6 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by David Blomquist and Tomasz Janiszewski, titled Apache Mesos Cookbook. Throughout the course of the book, you will get to know tips and tricks along with best practices to follow when working with Mesos.[/box] In this article, we will learn about configuring logging options, setting up monitoring ecosystem, and upgrading your Mesos cluster. Logging and debugging Here we will configure logging options that will allow us to debug the state of Mesos. Getting ready We will assume Mesos is available on localhost port 5050. The steps provided here will work for either master or agents. How to do it... When Mesos is installed from pre-built packages, the logs are by default stored in /var/log/mesos/. When installing from a source build, storing logs is disabled by default. To change the log store location, we need to edit /etc/default/mesos and set the LOGS variable to the desired destination. For some reason, mesos-init-wrapper does not transfer the contents of /etc/mesos/log_dir to the --log_dir flag. That's why we need to set the log's destination in the environment variable. Remember that only Mesos logs will be stored there. Logs from third-party applications (for example, ZooKeeper) will still be sent to STDERR. Changing the default logging level can be done in one of two ways: by specifying the -- logging_level flag or by sending a request and changing the logging level at runtime for a specific period of time. For example, to change the logging level to INFO, just put it in the following code: /etc/mesos/logging_level echo INFO > /etc/mesos/logging_level The possible levels are INFO, WARNING, and ERROR. For example, to change the logging level to the most verbose for 15 minutes for debug purposes, we need to send the following request to the logging/toggle endpoint: curl -v -X POST localhost:5050/logging/toggle?level=3&duration=15mins How it works... Mesos uses the Google-glog library for debugging, but third-party dependencies such as ZooKeeper have their own logging solution. All configuration options are backed by glog and apply only to Mesos core code. Monitoring Now, we will set up monitoring for Mesos. Getting ready We must have a running monitoring ecosystem. Metrics storage could be a simple time- series database such as graphite, influxdb, or prometheus. In the following example, we are using graphite and our metrics are published with http://diamond.readthedocs.io/en/latest/. How to do it... Monitoring is enabled by default. Mesos does not provide any way to automatically push metrics to the registry. However, it exposes them as a JSON that can be periodically pulled and saved into the metrics registry:  Install Diamond using following command: pip install diamond  If additional packages are required to install them, run: sudo apt-get install python-pip python-dev build-essential. pip (Pip Installs Packages) is a Python package manager used to install software written in Python. Configure the metrics handler and interval. Open /etc/diamond/diamond.conf and ensure that there is a section for graphite configuration: [handler_graphite] class = handlers.GraphiteHandler host = <graphite.host> port = <graphite.port> Remember to replace graphite.host and graphite.port with real graphite details. Enable the default Mesos Collector. Create configuration files diamond-setup  - C MesosCollector. Check whether the configuration has proper values and edit them if needed. The configuration can be found in /etc/diamond/collectors/MesosCollector.conf. On master, this file should look like this: enabled = True host = localhost port = 5050 While on agent, the port could be different (5051), as follows: enabled = True host = localhost port = 5051 How it works... Mesos exposes metrics via the HTTP API. Diamond is a small process that periodically pulls metrics, parses them, and sends them to the metrics registry, in this case, graphite. The default implementation of Mesos Collector does not store all the available metrics so it's recommended to write a custom handler that will collect all the interesting information. See also... Metrics could be read from the following endpoints: http://mesos.apache.org/documentation/latest/endpoints/metrics/snapshot/ http://mesos.apache.org/documentation/latest/endpoints/slave/monitor/statistics/  http://mesos.apache.org/documentation/latest/endpoints/slave/state/ Upgrading Mesos In this recipe, you will learn how to upgrade your Mesos cluster. How to do it... Mesos release cadence is at least one release per quarter. Minor releases are backward compatible, although there could be some small incompatibilities or the dropping of deprecated methods. The recommended method of upgrading is to apply all intermediate versions. For example, to upgrade from 0.27.2 to 1.0.0, we should apply 0.28.0, 0.28.1, 0.28.2, and finally 1.0.0. If the agent's configuration changes, clearing the metadata directory is required. You can do this with the following code: rm -rv {MESOS_DIR}/metadata Here, {MESOS_DIR} should be replaced with the configured Mesos directory. Rolling upgrades is the preferred method of upgrading clusters, starting with masters and then agents. To minimize the impact on running tasks, if an agent's configuration changes and it becomes inaccessible, then it should be switched to maintenance mode. How it works... Configuration changes may require clearing the metadata because the changes may not be backward compatible. For example, when an agent runs with different isolators, it shouldn't attach to the already running processes without this isolator. The Mesos architecture will guarantee that the executors that were not attached to the Mesos agent will commit suicide after a configurable amount of time (--executor_registration_timeout). Maintenance mode allows you to declare the time window during which the agent will be inaccessible. When this occurs, Mesos will send a reverse offer to all the frameworks to drain that particular agent. The frameworks are responsible for shutting down its task and spawning it on another agent. The Maintenance mode is applied, even if the framework does not implement the HTTP API or is explicitly declined. Using maintenance mode can prevent restarting tasks multiple times. Consider the following example with five agents and one task, X. We schedule the rolling upgrade of all the agents. Task X is deployed on agent 1. When it goes down, it's moved to 2, then to 3, and so on. This approach is extremely inefficient because the task is restarted five times, but it only needs to be restarted twice. Maintenance mode enables the framework to optimally schedule the task to run on agent 5 when 1 goes down, and then return to 1 when 5 goes down: Worst case scenario of rolling upgrade without maintenance mode legend optimal solution of rolling upgrade with maintenance mode. We have learnt about running and maintaining Mesos. To know more about managing containers and understanding the scheduler API you may check out this book, Apache Mesos Cookbook.
Read more
  • 0
  • 0
  • 2054

article-image-estimating-population-statistics-point-estimation
Aaron Lazar
12 Feb 2018
5 min read
Save for later

Estimating population statistics with Point Estimation

Aaron Lazar
12 Feb 2018
5 min read
[box type="note" align="" class="" width=""]This article is an extract from the book Principles of Data Science, written by Sinan Ozdemir. The book is a great way to get into the field of data science. It takes a unique approach that bridges the gap between mathematics and computer science, taking you through the entire data science pipeline.[/box] In this extract, we’ll learn how to estimate population means, variances and other statistics using the Point Estimation method. For the code samples, we’ve used Python 2.7. A point estimate is an estimate of a population parameter based on sample data. To obtain these estimates, we simply apply the function that we wish to measure for our population to a sample of the data. For example, suppose there is a company of 9,000 employees and we are interested in ascertaining the average length of breaks taken by employees in a single day. As we probably cannot ask every single person, we will take a sample of the 9,000 people and take a mean of the sample. This sample mean will be our point estimate. The following code is broken into three parts: We will use the probability distribution, known as the Poisson distribution, to randomly generate 9,000 answers to the question: for how many minutes in a day do you usually take breaks? This will represent our "population". We will take a sample of 100 employees (using the Python random sample method) and find a point estimate of a mean (called a sample mean). Compare our sample mean (the mean of the sample of 100 employees) to our population mean. Let's take a look at the following code: np.random.seed(1234) long_breaks = stats.poisson.rvs(loc=10, mu=60, size=3000) # represents 3000 people who take about a 60 minute break The long_breaks variable represents 3000 answers to the question: how many minutes on an average do you take breaks for?, and these answers will be on the longer side. Let's see a visualization of this distribution, shown as follows: pd.Series(long_breaks).hist() We see that our average of 60 minutes is to the left of the distribution. Also, because we only sampled 3000 people, our bins are at their highest around 700-800 people. Now, let's model 6000 people who take, on an average, about 15 minutes' worth of breaks. Let's again use the Poisson distribution to simulate 6000 people, as shown: short_breaks = stats.poisson.rvs(loc=10, mu=15, size=6000) # represents 6000 people who take about a 15 minute break pd.Series(short_breaks).hist() Okay, so we have a distribution for the people who take longer breaks and a distribution for the people who take shorter breaks. Again, note how our average break length of 15 minutes falls to the left-hand side of the distribution, and note that the tallest bar is about 1600 people. breaks = np.concatenate((long_breaks, short_breaks)) # put the two arrays together to get our "population" of 9000 people The breaks variable is the amalgamation of all the 9000 employees, both long and short break takers. Let's see the entire distribution of people in a single visualization: pd.Series(breaks).hist() We see how we have two humps. On the left, we have our larger hump of people who take about a 15 minute break, and on the right, we have a smaller hump of people who take longer breaks. Later on, we will investigate this graph further. We can find the total average break length by running the following code: breaks.mean() # 39.99 minutes is our parameter Our average company break length is about 40 minutes. Remember that our population is the entire company's employee size of 9,000 people, and our parameter is 40 minutes. In the real world, our goal would be to estimate the population parameter because we would not have the resources to ask every single employee in a survey their average break length for many reasons. Instead, we will use a point estimate. So, to make our point, we want to simulate a world where we ask 100 random people about the length of their breaks. To do this, let's take a random sample of 100 employees out of the 9,000 employees we simulated, as shown: sample_breaks = np.random.choice(a = breaks, size=100) # taking a sample of 100 employees Now, let's take the mean of the sample and subtract it from the population mean and see how far off we were: breaks.mean() - sample_breaks.mean() # difference between means is 4.09 minutes, not bad! This is extremely interesting, because with only about 1% of our population (100 out of 9,000), we were able to get within 4 minutes of our population parameter and get a very accurate estimate of our population mean. Not bad! Here, we calculated a point estimate for the mean, but we can also do this for proportion parameters. By proportion, I am referring to a ratio of two quantitative values. Let's suppose that in a company of 10,000 people, our employees are 20% white, 10% black, 10% Hispanic, 30% Asian, and 30% identify as other. We will take a sample of 1,000 employees and see if their race proportions are similar. employee_races = (["white"]*2000) + (["black"]*1000) +         (["hispanic"]*1000) + (["asian"]*3000) +         (["other"]*3000) employee_races represents our employee population. For example, in our company of 10,000 people, 2,000 people are white (20%) and 3,000 people are Asian (30%). Let's take a random sample of 1,000 people, as shown: demo_sample = random.sample(employee_races, 1000) # Sample 1000 values for race in set(demo_sample): print( race + " proportion estimate:" ) print( demo_sample.count(race)/1000. ) The output obtained would be as follows: hispanic proportion estimate: 0.103 white proportion estimate: 0.192 other proportion estimate: 0.288 black proportion estimate: 0.1 asian proportion estimate: 0.317 We can see that the race proportion estimates are very close to the underlying population's proportions. For example, we got 10.3% for Hispanic in our sample and the population proportion for Hispanic was 10%. To summarize we can say that you’re familiar with point estimation method to estimate population means, variances and other statistics, and implement them in Python. If you found our post useful, you can check out Principles of Data Science for more interesting Data Science tips and techniques.
Read more
  • 0
  • 0
  • 4131