How-To Tutorials

article-image-perform-regression-analysis-using-sas

27 Feb 2018

7 min read

How to perform regression analysis using SAS

27 Feb 2018

[box type="note" align="" class="" width=""]This article is an excerpt from the book, Big Data Analysis with SAS written by David Pope. This book will help you leverage the power of SAS for data management, analysis and reporting. It contains practical use-cases and real-world examples on predictive modelling, forecasting, optimizing, and reporting your Big Data analysis using SAS.[/box] Today, we will perform regression analysis using SAS in a step-by-step manner with a practical use-case. Regression analysis is one of the earliest predictive techniques most people learn because it can be applied across a wide variety of problems dealing with data that is related in linear and non-linear ways. Linear data is one of the easier use cases, and as such PROC REG is a well-known and often-used procedure to help predict likely outcomes before they happen. The REG procedure provides extensive capabilities for fitting linear regression models that involve individual numeric independent variables. Many other procedures can also fit regression models, but they focus on more specialized forms of regression, such as robust regression, generalized linear regression, nonlinear regression, nonparametric regression, quantile regression, regression modeling of survey data, regression modeling of survival data, and regression modeling of transformed variables. The SAS/STAT procedures that can fit regression models include the ADAPTIVEREG, CATMOD, GAM, GENMOD, GLIMMIX, GLM, GLMSELECT, LIFEREG, LOESS, LOGISTIC, MIXED, NLIN, NLMIXED, ORTHOREG, PHREG, PLS, PROBIT, QUANTREG, QUANTSELECT, REG, ROBUSTREG, RSREG, SURVEYLOGISTIC, SURVEYPHREG, SURVEYREG, TPSPLINE, and TRANSREG procedures. Several procedures in SAS/ETS software also fit regression models. SAS/STAT14.2 / SAS/STAT User's Guide - Introduction to Regression Procedures - Overview: Regression Procedures (http://documentation.sas.com/?cdcId=statcdccdcVersion=14.2 docsetId=statugdocsetTarget=statug_introreg_sect001.htmlocale=enshowBanner=yes). Regression analysis attempts to model the relationship between a response or output variable and a set of input variables. The response is considered the target variable or the variable that one is trying to predict, while the rest of the input variables make up parameters used as input into the algorithm. They are used to derive the predicted value for the response variable. PROC REG One of the easiest ways to determine if regression analysis is applicable to helping you answer a question is if the type of question being asked has only two answers. For example, should a bank lend an applicant money? Yes or no? This is known as a binary response, and as such, regression analysis can be applied to help determine the answer. In the following example, the reader will use the SASHELP.BASEBALL dataset to create a regression model to predict the value of a baseball player's salary. The SASHELP.BASEBALL dataset contains salary and performance information for Major League. Baseball players who played at least one game in both the 1986 and 1987 seasons, excluding pitchers. The salaries (Sports Illustrated, April 20, 1987) are for the 1987 season and the performance measures are from 1986 (Collier Books, The 1987 Baseball Encyclopedia Update). SAS/STAT® 14.2 / SAS/STAT User's Guide - Example 99: Modeling Salaries of Major League Baseball Players (http://documentation.sas.com/ ?cdcId= statcdc cdcVersion= 14.2 docsetId=statugdocsetTarget= statug_ reg_ examples01.htmlocale= en showBanner= yes). Let's first use PROC UNIVARIATE to learn something about this baseball data by submitting the following code: proc univariate data=sashelp.baseball; quit; While reviewing the results of the output, the reader will notice that the variance associated with logSalary, 0.79066, is much less than the variance associated with the actual target variable Salary, 203508. In this case, it makes better sense to attempt to predict the logSalary value of a player instead of Salary. Write the following code in a SAS Studio program section and submit it: proc reg data=sashelp.baseball; id name team league; model logSalary = nAtBat nHits nHome nRuns nRBI YrMajor CrAtBat CrHits CrHome CrRuns CrRbi; Quit; Notice that there are 59 observations as specified in the first output table with at least one of the input variables with missing values; as such those are not used in the development of the regression model. The Root Mean Squared Error (RMSE) and R-square are statistics that typically inform the analyst how good the model is in predicting the target. These range from 0 to 1.0 with higher values typically indicating a better model. The higher the Rsquared values typically indicate a better performing model but sometimes conditions or the data used to train the model over-fit and don't represent the true value of the prediction power of that particular model. Over-fitting can happen when an analyst doesn't have enough real-life data and chooses data or a sample of data that over-presents the target event, and therefore it will produce a poor performing model when using real-world data as input. Since several of the input values appear to have little predictive power on the target, an analyst may decide to drop these variables, thereby reducing the need for that information to make a decent prediction. In this case, it appears we only need to use four input variables. YrMajor, nHits, nRuns, and nAtBat. Modify the code as follows and submit it again: proc reg data=sashelp.baseball; id name team league; model logSalary = YrMajor nHits nRuns nAtBat; Quit; The p-value associated with each of the input variables provides the analyst with an insight into which variables have the biggest impact on helping to predict the target variable. In this case, the smaller the value, the higher the predictive value of the input variable. Both the RMSE and R-square values for this second model are slightly lower than the original. However, the adjusted R-square value is slightly higher. In this case, an analyst may chose to use the second model since it requires much less data and provides basically the same predictive power. Prior to accepting any model, an analyst should determine whether there are a few observations that may be over-influencing the results by investigating the influence and fit diagnostics. The default output from PROC REG provides this type of visual insight: The top-right corner plot, showing the externally studentized residuals (RStudent) by leverage values, shows that there are a few observations with high leverage that may be overly influencing the fit produced. In order to investigate this further, we will add a plots statement to our PROC REG to produce a labeled version of this plot. Type the following code in a SAS Studio program section and submit: proc reg data=sashelp.baseball plots(only label)=(RStudentByLeverage); id name team league; model logSalary = YrMajor nHits nRuns nAtBat; Quit; Sure enough, there are three to five individuals whose input variables may have excessive influence on fitting this model. Let's remove those points and see if the model improves. Type this code in a SAS Studio program section and submit it: proc reg data=sashelp.baseball plots=(residuals(smooth)); where name NOT IN ("Mattingly, Don", "Henderson, Rickey", "Boggs, Wade", "Davis, Eric", "Rose, Pete"); id name team league; model logSalary = YrMajor nHits nRuns nAtBat; Quit; This change, in itself, has not improved the model but actually made the model worse as can be seen by the R-square, 0.5592. However, the plots residuals(smooth) option gives some insights as it pertains to YrMajor; players at the beginning and the end of their careers tend to be paid less compared to others, as can be seen in Figure 4.12: In order to address this lack of fit, an analyst can use polynomials of degree two for this variable, YrMajor. Type the following code in a SAS Studio program section and submit it: data work.baseball; set sashelp.baseball; where name NOT IN ("Mattingly, Don", "Henderson, Rickey", "Boggs, Wade", "Davis, Eric", "Rose, Pete"); YrMajor2 = YrMajor*YrMajor; run; proc reg data=work.baseball; id name team league; model logSalary = YrMajor YrMajor2 nHits nRuns nAtBat; Quit; After removing some outliers and adjusting for the YrMajor variable, the model's predictive power has improved significantly as can be seen in the much improved R-square value of 0.7149. We saw an effective way of performing regression analysis using SAS platform. If you found our post useful, do check out this book Big Data Analysis with SAS to understand other data analysis models and perform them practically using SAS.

0
0
6482

article-image-how-to-win-kaggle-competition-with-apache-sparkml

Savia Lobo

27 Feb 2018

11 min read

How to win Kaggle competition with Apache SparkML

Savia Lobo

27 Feb 2018

11 min read

[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Mastering Apache Spark 2.x - Second Edition written by Romeo Kienzler. The book will introduce you to Project Tungsten and Catalyst, two of the major advancements of Apache Spark 2.x.[/box] In today’s tutorial we will show how to take advantage of Apache SparkML to win a Kaggle competition. We'll use an archived competition offered by BOSCH, a German multinational engineering and electronics company, on production line performance data. The data for this competition represents measurement of parts as they move through Bosch's production line. Each part has a unique Id. The goal is to predict which part will fail quality control (represented by a 'Response' = 1). For more details on the competition data you may visit the website: https://www.kaggle.com/c/bosch-production-line-p erformance/data. Data preparation The challenge data comes in three ZIP packages but we only use two of them. One contains categorical data, one contains continuous data, and the last one contains timestamps of measurements, which we will ignore for now. If you extract the data, you'll get three large CSV files. So the first thing that we want to do is re-encode them into parquet in order to be more space-efficient: def convert(filePrefix : String) = { val basePath = "yourBasePath" var df = spark .read .option("header",true) .option("inferSchema", "true") .csv("basePath+filePrefix+".csv") df = df.repartition(1) df.write.parquet(basePath+filePrefix+".parquet") } convert("train_numeric") convert("train_date") convert("train_categorical") First, we define a function convert that just reads the .csv file and rewrites it as a .parquet file. As you can see, this saves a lot of space: Now we read the files in again as DataFrames from the parquet files : var df_numeric = spark.read.parquet(basePath+"train_numeric.parquet") var df_categorical = spark.read.parquet(basePath+"train_categorical.parquet") Here is the output of the same: This is very high-dimensional data; therefore, we will take only a subset of the columns for this illustration: df_categorical.createOrReplaceTempView("dfcat") var dfcat = spark.sql("select Id, L0_S22_F545 from dfcat") In the following picture, you can see the unique categorical values of that column: Now let's do the same with the numerical dataset: df_numeric.createOrReplaceTempView("dfnum") var dfnum = spark.sql("select Id,L0_S0_F0,L0_S0_F2,L0_S0_F4,Response from dfnum") Here is the output of the same: Finally, we rejoin these two relations: var df = dfcat.join(dfnum,"Id") df.createOrReplaceTempView("df") Then we have to do some NA treatment: var df_notnull = spark.sql(""" select Response as label, case when L0_S22_F545 is null then 'NA' else L0_S22_F545 end as L0_S22_F545, case when L0_S0_F0 is null then 0.0 else L0_S0_F0 end as L0_S0_F0, case when L0_S0_F2 is null then 0.0 else L0_S0_F2 end as L0_S0_F2, case when L0_S0_F4 is null then 0.0 else L0_S0_F4 end as L0_S0_F4 from df """) Feature engineering Now it is time to run the first transformer (which is actually an estimator). It is StringIndexer and needs to keep track of an internal mapping table between strings and indexes. Therefore, it is not a transformer but an estimator: import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer} var indexer = new StringIndexer() .setHandleInvalid("skip") .setInputCol("L0_S22_F545") .setOutputCol("L0_S22_F545Index") var indexed = indexer.fit(df_notnull).transform(df_notnull) indexed.printSchema As we can see clearly in the following image, an additional column called L0_S22_F545Index has been created: Finally, let's examine some content of the newly created column and compare it with the source column. We can clearly see how the category string gets transformed into a float index: Now we want to apply OneHotEncoder, which is a transformer, in order to generate better features for our machine learning model: var encoder = new OneHotEncoder() .setInputCol("L0_S22_F545Index") .setOutputCol("L0_S22_F545Vec") var encoded = encoder.transform(indexed) As you can see in the following figure, the newly created column L0_S22_F545Vec contains org.apache.spark.ml.linalg.SparseVector objects, which is a compressed representation of a sparse vector: Note: Sparse vector representations: The OneHotEncoder, as many other algorithms, returns a sparse vector of the org.apache.spark.ml.linalg.SparseVector type as, according to the definition, only one element of the vector can be one, the rest has to remain zero. This gives a lot of opportunity for compression as only the position of the elements that are non-zero has to be known. Apache Spark uses a sparse vector representation in the following format: (l,[p],[v]), where l stands for length of the vector, p for position (this can also be an array of positions), and v for the actual values (this can be an array of values). So if we get (13,[10],[1.0]), as in our earlier example, the actual sparse vector looks like this: (0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0). So now that we are done with our feature engineering, we want to create one overall sparse vector containing all the necessary columns for our machine learner. This is done using VectorAssembler: import org.apache.spark.ml.feature.VectorAssembler import org.apache.spark.ml.linalg.Vectors var vectorAssembler = new VectorAssembler() .setInputCols(Array("L0_S22_F545Vec", "L0_S0_F0", "L0_S0_F2","L0_S0_F4")) setOutputCol("features") var assembled = vectorAssembler.transform(encoded) We basically just define a list of column names and a target column, and the rest is done for us: As the view of the features column got a bit squashed, let's inspect one instance of the feature field in more detail: We can clearly see that we are dealing with a sparse vector of length 16 where positions 0, 13, 14, and 15 are non-zero and contain the following values: 1.0, 0.03, -0.034, and -0.197. Done! Let's create a Pipeline out of these components. Testing the feature engineering pipeline Let's create a Pipeline out of our transformers and estimators: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.PipelineModel //Create an array out of individual pipeline stages var transformers = Array(indexer,encoder,assembled) var pipeline = new Pipeline().setStages(transformers).fit(df_notnull) var transformed = pipeline.transform(df_notnull) Note that the setStages method of Pipeline just expects an array of transformers and estimators, which we had created earlier. As parts of the Pipeline contain estimators, we have to run fit on our DataFrame first. The obtained Pipeline object takes a DataFrame in the transform method and returns the results of the transformations: As expected, we obtain the very same DataFrame as we had while running the stages individually in a sequence. Training the machine learning model Now it's time to add another component to the Pipeline: the actual machine learning algorithm-RandomForest: import org.apache.spark.ml.classification.RandomForestClassifier var rf = new RandomForestClassifier() .setLabelCol("label") .setFeaturesCol("features") var model = new Pipeline().setStages(transformers :+ rf).fit(df_notnull) var result = model.transform(df_notnull) This code is very straightforward. First, we have to instantiate our algorithm and obtain it as a reference in rf. We could have set additional parameters to the model but we'll do this later in an automated fashion in the CrossValidation step. Then, we just add the stage to our Pipeline, fit it, and finally transform. The fit method, apart from running all upstream stages, also calls fit on the RandomForestClassifier in order to train it. The trained model is now contained within the Pipeline and the transform method actually creates our predictions column: As we can see, we've now obtained an additional column called prediction, which contains the output of the RandomForestClassifier model. Of course, we've only used a very limited subset of available features/columns and have also not yet tuned the model, so we don't expect to do very well; however, let's take a look at how we can evaluate our model easily with Apache SparkML. Model evaluation Without evaluation, a model is worth nothing as we don't know how accurately it performs. Therefore, we will now use the built-in BinaryClassificationEvaluator in order to assess prediction performance and a widely used measure called areaUnderROC (going into detail here is beyond the scope of this book): import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator val evaluator = new BinaryClassificationEvaluator() import org.apache.spark.ml.param.ParamMap var evaluatorParamMap = ParamMap(evaluator.metricName -> "areaUnderROC") var aucTraining = evaluator.evaluate(result, evaluatorParamMap) As we can see, there is a built-in class called org.apache.spark.ml.evaluation.BinaryClassificationEvaluator and there are some other classes for other prediction use cases such as RegressionEvaluator or MulticlassClassificationEvaluator. The evaluator takes a parameter map--in this case, we are telling it to use the areaUnderROC metric--and finally, the evaluate method evaluates the result: As we can see, areaUnderROC is 0.5424418446501833. An ideal classifier would return a score of one. So we are only doing a bit better than random guesses but, as already stated, the number of features that we are looking at is fairly limited. Note : In the previous example we are using the areaUnderROC metric which is used for evaluation of binary classifiers. There exist an abundance of other metrics used for different disciplines of machine learning such as accuracy, precision, recall and F1 score. The following provides a good overview http://www.cs.cornell.edu/courses/cs578/2003fa/performance_measures.pdf This areaUnderROC is in fact a very bad value. Let's see if choosing better parameters for our RandomForest model increases this a bit in the next section. This areaUnderROC is in fact a very bad value. Let's see if choosing better parameters for our RandomForest model increases this a bit in the next section. CrossValidation and hyperparameter tuning As explained before, a common step in machine learning is cross-validating your model using testing data against training data and also tweaking the knobs of your machine learning algorithms. Let's use Apache SparkML in order to do this for us, fully automated! First, we have to configure the parameter map and CrossValidator: import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder} var paramGrid = new ParamGridBuilder() .addGrid(rf.numTrees, 3 :: 5 :: 10 :: 30 :: 50 :: 70 :: 100 :: 150 :: Nil) .addGrid(rf.featureSubsetStrategy, "auto" :: "all" :: "sqrt" :: "log2" :: "onethird" :: Nil) .addGrid(rf.impurity, "gini" :: "entropy" :: Nil) .addGrid(rf.maxBins, 2 :: 5 :: 10 :: 15 :: 20 :: 25 :: 30 :: Nil) .addGrid(rf.maxDepth, 3 :: 5 :: 10 :: 15 :: 20 :: 25 :: 30 :: Nil) .build() var crossValidator = new CrossValidator() .setEstimator(new Pipeline().setStages(transformers :+ rf)) .setEstimatorParamMaps(paramGrid) .setNumFolds(5) .setEvaluator(evaluator) var crossValidatorModel = crossValidator.fit(df_notnull) var newPredictions = crossValidatorModel.transform(df_notnull) The org.apache.spark.ml.tuning.ParamGridBuilder is used in order to define the hyperparameter space where the CrossValidator has to search and finally, the org.apache.spark.ml.tuning.CrossValidator takes our Pipeline, the hyperparameter space of our RandomForest classifier, and the number of folds for the CrossValidation as parameters. Now, as usual, we just need to call fit and transform on the CrossValidator and it will basically run our Pipeline multiple times and return a model that performs the best. Do you know how many different models are trained? Well, we have five folds on CrossValidation and five-dimensional hyperparameter space cardinalities between two and eight, so let's do the math: 5 * 8 * 5 * 2 * 7 * 7 = 19600 times! Using the evaluator to assess the quality of the cross-validated and tuned model Now that we've optimized our Pipeline in a fully automatic fashion, let's see how our best model can be obtained: var bestPipelineModel = crossValidatorModel.bestModel.asInstanceOf[PipelineModel] var stages = bestPipelineModel.stages import org.apache.spark.ml.classification.RandomForestClassificationModel val rfStage = stages(stages.length-1).asInstanceOf[RandomForestClassificationModel] rfStage.getNumTrees rfStage.getFeatureSubsetStrategy rfStage.getImpurity rfStage.getMaxBins rfStage.getMaxDepth The crossValidatorModel.bestModel code basically returns the best Pipeline. Now we use bestPipelineModel.stages to obtain the individual stages and obtain the tuned RandomForestClassificationModel using stages(stages.length 1).asInstanceOf[RandomForestClassificationModel]. Note that stages.length-1 addresses the last stage in the Pipeline, which is our RandomForestClassifier. So now, we can basically run evaluator using the best model and see how it performs: You might have noticed that 0.5362224872557545 is less than 0.5424418446501833, as we've obtained before. So why is this the case? Actually, this time we used cross-validation, which means that the model is less likely to over fit and therefore the score is a bit lower. So let's take a look at the parameters of the best model: Note that we've limited the hyperparameter space, so numTrees, maxBins, and maxDepth have been limited to five, and bigger trees will most likely perform better. So feel free to play around with this code and add features, and also use a bigger hyperparameter space, say, bigger trees. Finally, we've applied the concepts that we discussed on a real dataset from a Kaggle competition, which is a good starting point for your own machine learning project with Apache SparkML. If you found our post useful, do check out this book Mastering Apache Spark 2.x - Second Edition to know more about advanced analytics on your Big Data with latest Apache Spark 2.x.

0
0
3064

article-image-how-to-query-sharded-data-in-mongodb

Amey Varangaonkar

26 Feb 2018

6 min read

How to query sharded data in MongoDB

Amey Varangaonkar

26 Feb 2018

6 min read

[box type="note" align="" class="" width=""]The following excerpt is taken from the book Mastering MongoDB 3.x written by Alex Giamas. This book covers the essential as well as advanced administration concepts in MongoDB.[/box] Querying data using a MongoDB shard is different than a single server deployment or a replica set. Instead of connecting to the single server or the primary of the replica set, we connect to the mongos router which decides which shard to ask for our data. In this article, we will explore how the MongoDB query router operates and showcase how the process is similar to working with a replica set, using Ruby. The query router The query router, also known as mongos process, acts as the interface and entry point to our MongoDB cluster. Applications connect to it instead of connecting to the underlying shards and replica sets; mongos executes queries, gathers results, and passes them to our application. mongos doesn't hold any persistent state and is typically low on system resources, and is typically hosted in the same instance as the application server. It is acting as a proxy for requests. When a query comes in, mongos will examine and decide which shards need to execute the query and establish a cursor in each one of them. The Find operation If our query includes the shard key or a prefix of the shard key, mongos will perform a targeted operation, only querying the shards that hold the keys that we are looking for. For example, with a composite shard key of {_id, email, address} on our collection User, we can have a targeted operation with any of the following queries: > db.User.find({_id: 1}) > db.User.find({_id: 1, email: '[email protected]'}) > db.User.find({_id: 1, email: '[email protected]', address: 'Linwood Dunn'}) All three of them are either a prefix (the first two) or the complete shard key. On the other hand, a query on {email, address} or {address} will not be able to target the right shards, resulting in a broadcast operation. A broadcast operation is any operation that doesn't include the shard key or a prefix of the shard key and results in mongos querying every shard and gathering results from them. It's also known as a scatter-and-gather operation or a fanout query. Sort/limit/skip operations If we want to sort our results, there are two options: If we are using the shard key in our sort criteria, then mongos can determine the order in which it has to query the shard or shards. This results in an efficient and, again, targeted operation. If we are not using the shard key in our sort criteria, then as with a query without sort, it's going to be a fanout query. To sort the results when we are not using the shard key, the primary shard executes a distributed merge sort locally before passing on the sorted result set to mongos. Limit on queries is enforced on each individual shard and then again at the mongos level as there may be results from multiple shards. Skip, on the other hand, cannot be passed on to individual shards and will be applied by mongos after retrieving all the results locally. Update/remove operations In document modifier operations like update and remove, we have a similar situation to find. If we have the shard key in the find section of the modifier, then mongos can direct the query to the relevant shard. If we don't have the shard key in the find section, then it will again be a fanout operation. In essence, we have the following cases for operations with sharding: Type of operation Query topology insert Must have the shard key update Can have the shard key Query with shard key Targeted operation Query without shard key Scatter gather/fanout query Indexed/sorted query with shard key Targeted operation Indexed/sorted query without shard key Distributed sort merge Querying using Ruby Connecting to a sharded cluster using Ruby is no different than connecting to a replica set. Using the Ruby official driver we have to configure the client object to define the set of mongos servers: client = Mongo::Client.new('mongodb://key:password@mongos-server1- Host:mongos-server1-port,mongos-server2-host:mongos-server2- port/admin?ssl=true&authSource=admin') The mongo-ruby-driver will then return a client object that is no different than connecting to a replica set from the Mongo Ruby client. We can then use the client object like we did in previous chapters, with all the caveats around how sharding behaves differently than a standalone server or a replica set with regards to querying and performance. Performance comparison with replica sets Developers and architects are always looking out for ways to compare performance between replica sets and sharded configurations. The way MongoDB implements sharding, it is based on top of replica sets. Every shard in production should be a replica set. The main difference in performance comes from fan out queries. When we are querying without the shard key, MongoDB's execution time is limited by the worst-performing replica set. In addition, when using sorting without the shard key, the primary server has to implement the distributed merge sort on the entire dataset. This means that it has to collect all data from different shards, merge-sort them, and pass them as sorted to mongos. In both cases, network latency and limitations in bandwidth can slow down operations as opposed to a replica set. On the flip side, by having three shards, we can distribute our working set requirements across different nodes, thus serving results from RAM instead of reaching out to the underlying storage, HDD or SSD. On the other hand, writes can be sped up significantly since we are no longer bound by a single node's I/O capacity but we can have writes in as many nodes as there are shards. To sum up, in most cases and especially for the cases that we are using the shard key, both queries and modification operations will be significantly sped up by sharding. If you found this post useful, check out our book Mastering MongoDB 3.x for more tips and techniques on sharding, replication and other database administration tasks related to MongoDB.

0
0
10335

article-image-performing-descriptive-analysis-with-sas

Gebin George

26 Feb 2018

5 min read

Performing descriptive analysis with SAS

Gebin George

26 Feb 2018

5 min read

This article is an excerpt from a book written by David Pope titled Big Data Analysis with SAS. This book will help you combine SAS with platforms such as Hadoop, SAP HANA, and Cloud Foundry-based platforms for efficient Big Data analytics. In today’s tutorial, we will perform descriptive analysis using SAS with practical use-cases. The following are few examples of descriptive analysis. Let us take a look at each one in detail. PROC FREQ How many males versus females are in a particular table, say SASHELP.CLASS? PROC FREQ can be used to easily find the answer to this type of question. Type the following code in a SAS Studio program section and submit it: proc freq data=sashelp.class; tables sex; Quit; If you remove the tables statement, then, by default, PROC FREQ produces a one-frequency table for all the variables within the dataset. PROC CORR Are the height and weight of a fish related to each other, and do their lengths have any impact on this relationship if it exists? PROC CORR can be used to determine this. In these examples, the plots option will be used to provide more insights by producing an additional graphic plot output along with the statistical results. Type the following code in a SAS Studio program section and submit it: proc corr data=sashelp.fish plots=matrix(histogram); var height weight length1 length2 length3; Quit; The simple statistics table provides the descriptive univariate statistics for all five variables listed in the var statement. An insight regarding a very minor data quality issue can be seen in this table—one of the 159 observations in this dataset is missing a value for weight. The higher the Pearson correlation coefficient for a pair of variables (which means closer to 1.0), the stronger the relationship between the variables. While height and weight do have a strong relationship, it is interesting to note that the relationships of weight to all three length variables are stronger than the relationships of height to all three length Variables: In this next example, the code is still searching for a relationship between height and weight. However, now the relationship is being adjusted for the effect of the partial variables for which the three length variables have been assigned. Instead of requesting a matrix plot, the code requests a scatter plot with three different prediction ellipses. Type the following code in a SAS Studio program section and submit it: proc corr data=sashelp.fish plots=scatter(alpha=.15 .25 .35); var height weight; partial length1 length2 length3; quit; The results indicate that the partial relationship between height and weight is weaker than the unpartialled one; 0.46071 is less than 0.72869. However, both relationships are statistically relevant since both have p-values of <.0001. The smaller the p-value becomes, the more statistically relevant the variable is to what is being analyzed: Prediction ellipses are regions used to predict an observation based on values of the associated population. This particular code requests three prediction ellipses, each of which contains a specified percentage of the population, in this case, 85%, 75%, and 65%. Change the plots option to the following, and submit the code: proc corr data=sashelp.fish plots=scatter(ellipse=confidence alpha=.10 .05); var height weight; partial length1 length2 length3; quit; A confidence ellipse provides an estimate range for the population's mean associated with a level of confidence in that range. In this example, there are two ellipse ranges, one at a 90% confidence level and one at a 95% confidence level. If the relationships between variables are not linear, or there are a lot of outliers in the data being analyzed, the correlation coefficient might incorrectly estimate the strength of the relationship. Therefore, visualizing the data through these types of plots enables an analyst to verify the linear relationship and spot potential outliers. PROC UNIVARIATE Some of the output associated with PROC UNIVARIATE was seen in the simple statistics table in the output associated with the PROC CORR examples in the previous section. Type the following code in a SAS Studio program section and submit it: proc univariate data=sashelp.fish; quit; By running PROC UNIVARIATE on an entire table, the applicable variable within that data will have the descriptive statistics seen in Figure 4.7. An analyst can control which tables show up in the results by using certain Output Delivery System (ODS) statements along with procedures. ODS is another part of BASE SAS that helps produce output and graphics in a variety of different formats. For example, if an analyst is only interested in the extreme observations of all the variables within a table, they can limit the PROC UNIVARIATE output to only the extreme observations table. Type this code in a SAS Studio program section and submit it: title "Extreme Observations in SASHELP.FISH"; ods select ExtremeObs; proc univariate data=sashelp.fish; Quit; We learned how to perform descriptive analysis on SAS platform with the help of a practical use-case. If you found this post useful, do check out the book Big Data Analysis with SAS to leverage the capabilities of SAS for processing and analyzing Big Data.

0
0
3645

Ankit Dixit

26 Feb 2018

6 min read

What is ensemble learning?

Ankit Dixit

26 Feb 2018

6 min read

Ensemble learning explained There are many instances when a single machine learning model simply isn't good enough - you need to use multiple models. Here is what an ensemble framework looks like: In everyday life we use a form of ensemble learning for our daily decision making. For example, suppose you are applying to a university to enrol on an ensemble machine learning course. How do you decide whether it is the right choice or not? There are always multiple things that will inform your decision: Other students' reviews: Students may provide information such as whether this course is useful for improving skill sets, information about the course curriculum, the practical sessions, and so on. But as the students are not fully aware of the course's details (obviously, that's why they are taking that course) and also because they cannot suggest any other competitive course, you cannot rely on their suggestion solely. However, you know that their suggestions helped you in the past to choose previous courses. Let's say they are correct 60% of the time. Study counselors: You can get information regarding other competitive courses. They also know which universities have experts for the domain, so they can help you out to choose one course over other ML courses. Let's assume that these counselors are correct 60% of the time. Career counselors: Why do you want to take this course? Of course, for a better job, so a career counselor can tell you about the current requirements related to this skill set. They can tell you whether this course will help you enhance your career. Keep in mind that career counselors are right most of the time, say 40%. Social media: Yes it helps! Here, you can join many discussion forums, find many suggestions, and pros and cons of the course. You can get suggestions from a large audience and they can perform a very critical role in your decision. This can show you in which region of the country there is more demand for a certain course, or where it is not considered an important skill. It may be correct, say, 40% of the time. Placement officer: A placement officer is a person who takes care of job placements; so they better know which specific companies need employees for this domain. Also they will give you assurance of getting a decent job after completion of this course. And trust me, 70% of the time you will go with their review, because they will help you in getting a job! None of these things alone are going to help you come to a final answer. But by combining each of these different components, you’ll probably come to a decision. Let's take a look at how you can make a decision on whether you should go with this course or not. Let's quickly analyze the scenario. As all the experts are from an independent system, we can get a very high accuracy rate as follows: 1-60%*60%*40%*40%*70% 1-0.040 96% Can you see what we have got by the combined decision? I think it is more than you think. Can we further improve it? Yes we can; for that, we have to take suggestions from more sources, such as course faculties, a company's workers, and so on. The preceding example is based on an assumption that the suggestions from all the sources are independent. Well, in a practical scenario, this is not possible. If we are talking about the same domain, more or less there will be a correlation between the suggestions. Suppose we choose six sources but all are students of that course. Then we cannot reach the correct decision with high confidence; this is where the power of ensembles comes into the picture, where you have multiple predictions and you combine all of them to get a high-confidence prediction. Let's enter the world of ensembles. When to use ensemble learning There are many reasons to go for ensembles, as each model of the group is usually based on algorithms, some of which are very simple and less computation intensive, but some may be quite complex and more computation intensive. For any production environment, accuracy and computation time are equally important. A system with higher accuracy but one that is unimplementable for real-time applications is of no use. However, a simple algorithm may lack in accuracy and may not fit onto the data properly; in those cases, we have to make a compromise between accuracy and computation time. This compromise can be minimized if we use many weak learners to get a combined confidence index out of them, which may help us to implement such a system for real-time applications with very high accuracy. These are the main instanced when you should use ensemble learning: The dataset is too large or too small: When a dataset is too large and so it cannot be trained by a single model, we can create a small subset of data to train different models. At the end, we can choose the average of all as the final prediction. Similarly, when a dataset is too small to train a single model, we can use bootstrap methods to create random subsamples of data to train the models. Complex (nonlinear) data: Most of the time, a real-world dataset is a nonlinear dataset, where a single model cannot define the class boundary clearly. This is known as underfitting of the model. In such cases, we can use more than one model to train different subsets of the data and average out the result at the end to predict distinct boundaries. High confidence: When we train multiple classifiers on the training dataset and get mostly correlated output, it ensures a high prediction rate. Consider a case of classification where most of our classifiers predict the same class for an instance; in such cases, interprets ensemble system having high confidence on its decision. As with just about everything in the machine learning world, the key thing is to select the right model for the data you have at your disposal and the questions you're trying to answer. If you want to learn more about ensemble learning, explore it in depth in Ensemble Machine Learning from which this post has been taken. Find more machine learning eBooks and videos. Explore deep learning eBooks and videos.

0
0
3831

How-To Tutorials

article-image-how-to-execute-jobs-in-an-iterative-way-with-pentaho-data-integration-pdi

Vijin Boricha

26 Feb 2018

8 min read

How to execute jobs in an iterative way with Pentaho Data Integration (PDI)

Vijin Boricha

26 Feb 2018

8 min read

0
0
15839

article-image-analyzing-textual-data-using-the-nltk-library

Sugandha Lahoti

24 Feb 2018

16 min read

Analyzing Textual Data using the NLTK Library

Sugandha Lahoti

24 Feb 2018

16 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Armando Fandango titled Python Data Analysis - Second Edition. This book will help you learn to apply powerful data analysis techniques with popular open source Python modules. Code bundle for this article is hosted on GitHub.[/box] In this book excerpt, we will talk about various ways of performing text analytics using the NLTK Library. Natural Language Toolkit (NLTK) is one of the main libraries used for text analysis in Python. It comes with a collection of sample texts called corpora. Let's install the libraries required in this article with the following command: $ pip3 install nltk scikit-learn NLTK is a Python API for the analysis of texts written in natural languages, such as English. NLTK was created in 2001 and was originally intended as a teaching tool. Although we installed NLTK in the previous section, we are not done yet; we still need to download the NLTK corpora. The download is relatively large (about 1.8 GB); however, we only have to download it once. Unless you know exactly which corpora you require, it's best to download all the available corpora. Download the corpora from the Python shell as follows: $ python3 >>> import nltk >>> nltk.download() A GUI application should appear, where you can specify a destination and what file to download. If you are new to NLTK, it's most convenient to choose the default option and download everything. In this article, we will need the stopwords, movie reviews, names, and Gutenberg corpora. Readers are encouraged to follow the sections in the ch-09.ipynb file. Filtering out stopwords, names, and numbers Stopwords are common words that have very low information value in a text. It is a common practice in text analysis to get rid of stopwords. NLTK has a stopwords corpora for a number of languages. Load the English stopwords corpus and print some of the words: sw = set(nltk.corpus.stopwords.words('english')) print("Stop words:", list(sw)[:7]) The following common words are printed: Stop words: ['between', 'who', 'such', 'ourselves', 'an', 'ain', 'ours'] Note that all the words in this corpus are in lowercase. NLTK also has a Gutenberg corpus. The Gutenberg project is a digital library of books, mostly with expired copyright, which are available for free on the Internet (see http://www.gutenberg.org/). Load the Gutenberg corpus and print some of its filenames: gb = nltk.corpus.gutenberg print("Gutenberg files:n", gb.fileids()[-5:]) Some of the titles printed may be familiar to you: Gutenberg files: ['milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt'] Extract the first couple of sentences from the milton-paradise.txt file, which we will filter later: text_sent = gb.sents("milton-paradise.txt")[:2] print("Unfiltered:", text_sent) The following sentences are printed: Unfiltered [['[', 'Paradise', 'Lost', 'by', 'John', 'Milton', '1667', ']'], ['Book', 'I']] Now, filter out the stopwords as follows: for sent in text_sent: filtered = [w for w in sent if w.lower() not in sw] print("Filtered:n", filtered) For the first sentence, we get the following output: Filtered ['[', 'Paradise', 'Lost', 'John', 'Milton', '1667', ']'] If we compare this with the previous snippet, we notice that the word by has been filtered out as it was found in the stopwords corpus. Sometimes, we want to remove numbers and names too. We can remove words based on part of speech (POS) tags. In this tagging scheme, numbers correspond to the cardinal number (CD) tag. Names correspond to the proper noun singular (NNP) tag. Tagging is an inexact process based on heuristics. It's a big topic that deserves an entire book. Tag the filtered text with the pos_tag() function: tagged = nltk.pos_tag(filtered) print("Tagged:n", tagged) For our text, we get the following tags: Tagged [('[', 'NN'), ('Paradise', 'NNP'), ('Lost', 'NNP'), ('John', 'NNP'), ('Milton', 'NNP'), ('1667', 'CD'), (']', 'CD')] The pos_tag() function returns a list of tuples, where the second element in each tuple is the tag. As you can see, some of the words are tagged as NNP, although they probably shouldn't be. The heuristic here is to tag words as NNP if the first character of a word is uppercase. If we set all the words to be lowercase, we will get a different result. This is left as an exercise for the reader. It's easy to remove the words in the list with the NNP and CD tags, as described in the following code: words= [] for word in tagged: if word[1] != 'NNP' and word[1] != 'CD': words.append(word[0]) print(words) Have a look at the ch-09.ipynb file in the book’s code bundle: import nltk sw = set(nltk.corpus.stopwords.words('english')) print(“Stop words:", list(sw)[:7]) gb = nltk.corpus.gutenberg print(“Gutenberg files:n", gb.fileids()[-5:]) text_sent = gb.sents("milton-paradise.txt")[:2] print(“Unfiltered:", text_sent) for sent in text_sent: filtered = [w for w in sent if w.lower() not in sw] print("Filtered:n", filtered) tagged = nltk.pos_tag(filtered) print("Tagged:n", tagged) words= [] for word in tagged: if word[1] != 'NNP' and word[1] != 'CD': words.append(word[0]) print(“Words:n",words) The bag-of-words model In the bag-of-words model, we create from a document a bag containing words found in the document. In this model, we don't care about the word order. For each word in the document, we count the number of occurrences. With these word counts, we can do statistical analysis, for instance, to identify spam in e-mail messages. If we have a group of documents, we can view each unique word in the corpus as a feature; here, feature means parameter or variable. Using all the word counts, we can build a feature vector for each document; vector is used here in the mathematical sense. If a word is present in the corpus but not in the document, the value of this feature will be 0. Surprisingly, NLTK doesn't currently have a handy utility to create a feature vector. However, the machine learning Python library, scikit-learn, does have a CountVectorizer class that we can use. Load two text documents from the NLTK Gutenberg corpus: hamlet = gb.raw("shakespeare-hamlet.txt") macbeth = gb.raw("shakespeare-macbeth.txt") Create the feature vector by omitting English stopwords: cv = sk.feature_extraction.text.CountVectorizer(stop_words='english') print("Feature vector:n", cv.fit_transform([hamlet, macbeth]).toarray()) These are the feature vectors for the two documents: Feature vector: [[ 1 0 1 ..., 14 0 1] [ 0 1 0 ..., 1 1 0]] Print a small selection of the features (unique words) that we found: print("Features:n", cv.get_feature_names()[:5]) The features are given in alphabetical order: Features: ['1599', '1603', 'abhominably', 'abhorred', 'abide'] Have a look at the ch-09.ipynb file in this book’s code bundle: import nltk import sklearn as sk hamlet = gb.raw("shakespeare-hamlet.txt") macbeth = gb.raw("shakespeare-macbeth.txt") cv = sk.feature_extraction.text.CountVectorizer(stop_words='english') print(“Feature vector:n”, cv.fit_transform([hamlet, macbeth]).toarray()) print("Features:n", cv.get_feature_names()[:5]) Analyzing word frequencies The NLTK FreqDist class encapsulates a dictionary of words and counts for a given list of words. Load the Gutenberg text of Julius Caesar by William Shakespeare. Let's filter out the stopwords and punctuation: punctuation = set(string.punctuation) filtered = [w.lower() for w in words if w.lower() not in sw and w.lower() not in punctuation] Create a FreqDist object and print the associated keys and values with the highest frequency: fd = nltk.FreqDist(filtered) print("Words", fd.keys()[:5]) print("Counts", fd.values()[:5]) The keys and values are printed as follows: Words ['d', 'caesar', 'brutus', 'bru', 'haue'] Counts [215, 190, 161, 153, 148] The first word in this list is, of course, not an English word, so we may need to add the heuristic that words have a minimum of two characters. The NLTK FreqDist class allows dictionary-like access, but it also has convenience methods. Get the word with the highest frequency and related count: print("Max", fd.max()) print("Count", fd['d']) The following result shouldn't be a surprise: Max d Count 215 Up until this point, the analysis has focused on single words, but we can extend the analysis to word pairs and triplets. These are also called bigrams and trigrams. We can find them with the bigrams() and trigrams() functions. Repeat the analysis, but this time for bigrams: fd = nltk.FreqDist(nltk.bigrams(filtered)) print("Bigrams", fd.keys()[:5]) print("Counts", fd.values()[:5]) print("Bigram Max", fd.max()) print("Bigram count", fd[('let', 'vs')]) The following output should be printed: Bigrams [('let', 'vs'), ('wee', 'l'), ('mark', 'antony'), ('marke', 'antony'), ('st', 'thou')] Counts [16, 15, 13, 12, 12] Bigram Max ('let', 'vs') Bigram count 16 Have a peek at the ch-09.ipynb file in this book's code bundle: import nltk import string gb = nltk.corpus.gutenberg words = gb.words("shakespeare-caesar.txt") sw = set(nltk.corpus.stopwords.words('english')) punctuation = set(string.punctuation) filtered = [w.lower() for w in words if w.lower() not in sw and w.lower() not in punctuation] fd = nltk.FreqDist(filtered) print("Words", fd.keys()[:5]) print("Counts", fd.values()[:5]) print("Max", fd.max()) print("Count", fd['d']) fd = nltk.FreqDist(nltk.bigrams(filtered)) print("Bigrams", fd.keys()[:5]) print("Counts", fd.values()[:5]) print("Bigram Max", fd.max()) print("Bigram count", fd[('let', 'vs')]) Naive Bayes classification Classification algorithms are a type of machine learning algorithm that determine the class (category or type) of a given item. For instance, we could try to determine the genre of a movie based on some features. In this case, the genre is the class to be predicted. In this section, we will discuss a popular algorithm called Naive Bayes classification, which is frequently used to analyze text documents. Naive Bayes classification is a probabilistic algorithm based on the Bayes theorem from probability theory and statistics. The Bayes theorem formulates how to discount the probability of an event based on new evidence. For example, imagine that we have a bag with pieces of chocolate and other items we can't see. We will call the probability of drawing a piece of dark chocolate P(D). We will denote the probability of drawing a piece of chocolate as P(C). Of course, the total probability is always 1, so P(D) and P(C) can be at most 1. The Bayes theorem states that the posterior probability is proportional to the prior probability times likelihood: P(D|C) in the preceding notation means the probability of event D given C. When we haven't drawn any items, P(D) = 0.5 because we don't have any information yet. To actually apply the formula, we need to know P(C|D) and P(C), or we have to determine those indirectly. Naive Bayes classification is called naive because it makes the simplifying assumption of independence between features. In practice, the results are usually pretty good, so this assumption is often warranted to a certain level. Recently, it was found that there are theoretical reasons why the assumption makes sense. However, since machine learning is a rapidly evolving field, algorithms have been invented with (slightly) better performance. Let's try to classify words as stopwords or punctuation. As a feature, we will use the word length, since stopwords and punctuation tend to be short. This setup leads us to define the following functions: def word_features(word): return {'len': len(word)} def isStopword(word): return word in sw or word in punctuation Label the words in the Gutenberg shakespeare-caesar.txt based on whether or not they are stopwords: labeled_words = ([(word.lower(), isStopword(word.lower())) for word in words]) random.seed(42) random.shuffle(labeled_words) print(labeled_words[:5]) The 5 labeled words will appear as follows: [('was', True), ('greeke', False), ('cause', False), ('but', True), ('house', False)] For each word, determine its length: featuresets = [(word_features(n), word) for (n, word) in labeled_words] We will train a naive Bayes classifier on 90 percent of the words and test the remaining 10 percent. Create the train and the test set, and train the data: cutoff = int(.9 * len(featuresets)) train_set, test_set = featuresets[:cutoff], featuresets[cutoff:] classifier = nltk.NaiveBayesClassifier.train(train_set) We can now check how the classifier labels the words in the sets: classifier = nltk.NaiveBayesClassifier.train(train_set) print("'behold' class", classifier.classify(word_features('behold'))) print("'the' class", classifier.classify(word_features('the'))) Fortunately, the words are properly classified: 'behold' class False 'the' class True Determine the classifier accuracy on the test set as follows: print("Accuracy", nltk.classify.accuracy(classifier, test_set)) We get a high accuracy for this classifier of around 85 percent. Print an overview of the most informative features: print(classifier.show_most_informative_features(5)) The overview shows the word lengths that are most useful for the classification process: The code is in the ch-09.ipynb file in this book's code bundle: import nltk import string import random sw = set(nltk.corpus.stopwords.words('english')) punctuation = set(string.punctuation) def word_features(word): return {'len': len(word)} def isStopword(word): return word in sw or word in punctuation gb = nltk.corpus.gutenberg words = gb.words("shakespeare-caesar.txt") labeled_words = ([(word.lower(), isStopword(word.lower())) for word in words]) random.seed(42) random.shuffle(labeled_words) print(labeled_words[:5]) featuresets = [(word_features(n), word) for (n, word) in labeled_words] cutoff = int(.9 * len(featuresets)) train_set, test_set = featuresets[:cutoff], featuresets[cutoff:] classifier = nltk.NaiveBayesClassifier.train(train_set) print("'behold' class", classifier.classify(word_features('behold'))) print("'the' class", classifier.classify(word_features('the'))) print("Accuracy", nltk.classify.accuracy(classifier, test_set)) print(classifier.show_most_informative_features(5)) Sentiment analysis Opinion mining or sentiment analysis is a hot new research field dedicated to the automatic evaluation of opinions as expressed on social media, product review websites, or other forums. Often, we want to know whether an opinion is positive, neutral, or negative. This is, of course, a form of classification, as seen in the previous section. As such, we can apply any number of classification algorithms. Another approach is to semi-automatically (with some manual editing) compose a list of words with an associated numerical sentiment score (the word “good” can have a score of 5 and the word “bad” a score of -5). If we have such a list, we can look up all the words in a text document and, for example, sum up all the found sentiment scores. The number of classes can be more than three, as in a five-star rating scheme. We will apply naive Bayes classification to the NLTK movie reviews corpus with the goal of classifying movie reviews as either positive or negative. First, we will load the corpus and filter out stopwords and punctuation. These steps will be omitted, since we have performed them before. You may consider more elaborate filtering schemes, but keep in mind that excessive filtering may hurt accuracy. Label the movie reviews documents using the categories() method: labeled_docs = [(list(movie_reviews.words(fid)), cat) for cat in movie_reviews.categories() for fid in movie_reviews.fileids(cat)] The complete corpus has tens of thousands of unique words that we can use as features. However, using all these words might be inefficient. Select the top 5 percent of the most frequent words: words = FreqDist(filtered) N = int(.05 * len(words.keys())) word_features = words.keys()[:N] For each document, we can extract features using a number of methods, including the following: Check whether the given document has a word or not Determine the number of occurrences of a word for a given document Normalize word counts so that the maximum normalized word count will be less than or equal to 1 Take the logarithm of counts plus 1 (to avoid taking the logarithm of zero) Combine all the previous points into one metric As the saying goes, all roads lead to Rome. Of course, some roads are safer and will bring you to Rome faster. Define the following function, which uses raw word counts as a metric: def doc_features(doc): doc_words = FreqDist(w for w in doc if not isStopWord(w)) features = {} for word in word_features: features['count (%s)' % word] = (doc_words.get(word, 0)) return features We can now train our classifier just as we did in the previous example. An accuracy of 78 percent is reached, which is decent and comes close to what is possible with sentiment analysis. Research has found that even humans don't always agree on the sentiment of a given document (see http://mashable.com/2010/04/19/sentiment-analysis/), and therefore, we can't have a 100 percent perfect accuracy with sentiment analysis software. The most informative features are printed as follows: If we go through this list, we find obvious positive words such as “wonderful” and “outstanding”. The words “bad”, “stupid”, and “boring” are the obvious negative words. It would be interesting to analyze the remaining features. This is left as an exercise for the reader. Refer to the sentiment.py file in this book's code bundle: import random from nltk.corpus import movie_reviews from nltk.corpus import stopwords from nltk import FreqDist from nltk import NaiveBayesClassifier from nltk.classify import accuracy import string labeled_docs = [(list(movie_reviews.words(fid)), cat) for cat in movie_reviews.categories() for fid in movie_reviews.fileids(cat)] random.seed(42) random.shuffle(labeled_docs) review_words = movie_reviews.words() print("# Review Words", len(review_words)) sw = set(stopwords.words('english')) punctuation = set(string.punctuation) def isStopWord(word): return word in sw or word in punctuation filtered = [w.lower() for w in review_words if not isStopWord(w.lower())] print("# After filter", len(filtered)) words = FreqDist(filtered) N = int(.05 * len(words.keys())) word_features = words.keys()[:N] def doc_features(doc): doc_words = FreqDist(w for w in doc if not isStopWord(w)) features = {} for word in word_features: features['count (%s)' % word] = (doc_words.get(word, 0)) return features featuresets = [(doc_features(d), c) for (d,c) in labeled_docs] train_set, test_set = featuresets[200:], featuresets[:200] classifier = NaiveBayesClassifier.train(train_set) print("Accuracy", accuracy(classifier, test_set)) print(classifier.show_most_informative_features()) We covered textual analysis and learned that it's a best practice to get rid of stopwords. In the bag-of-words model, we used a document to create a bag containing words found in that same document. We learned how to build a feature vector for each document using all the word counts. Classification algorithms are a type of machine learning algorithm, which involve determining the class of a given item. Naive Bayes classification is a probabilistic algorithm based on the Bayes theorem from probability theory and statistics. The Bayes theorem states that the posterior probability is proportional to the prior probability multiplied by the likelihood. If you liked this post, check out the book Python Data Analysis - Second Edition to know more about analyzing other forms of textual data and social media analysis.

0
0
4709

article-image-fat-2018-conference-session-5-summary-fat-recommenders-etc

Savia Lobo

24 Feb 2018

6 min read

FAT* 2018 Conference Session 5 Summary on FAT Recommenders, Etc.

Savia Lobo

24 Feb 2018

6 min read

This session of FAT 2018 is about Recommenders, etc. Recommender systems are algorithmic tools for identifying items of interest to users. They are usually deployed to help mitigate information overload. Internet-scale item spaces offer many more choices than humans can process, diminishing the quality of their decision-making abilities. Recommender systems alleviate this problem by allowing users to more quickly focus on items likely to match their particular tastes. They are deployed across the modern Internet, suggesting products in e-commerce sites, movies and music in streaming media platforms, new connections on social networks, and many more types of items. This session explains what Fairness, Accountability, and Transparency means in the context of recommendation. The session also includes a paper that talks about predictive policing, which is defined as ‘Given historical crime incident data for a collection of regions, decide how to allocate patrol officers to areas to detect crime.’ The Conference on Fairness, Accountability, and Transparency (FAT), which would be held on the 23rd and 24th of February, 2018 is a multi-disciplinary conference that brings together researchers and practitioners interested in fairness, accountability, and transparency in socio-technical systems. The FAT 2018 conference will witness 17 research papers, 6 tutorials, and 2 keynote presentations from leading experts in the field. This article covers research papers pertaining to the 5th session that is dedicated to FAT Recommenders, etc. Paper 1: Runaway Feedback Loops in Predictive Policing Predictive policing systems are increasingly being used to determine how to allocate police across a city in order to best prevent crime. To update the model, discovered crime data (e.g., arrest counts) are used. Such systems have been empirically shown to be susceptible to runaway feedback loops, where police are repeatedly sent back to the same neighborhoods regardless of the true crime rate. This paper is in response to this system, where the authors have developed a mathematical model of predictive policing that proves why this feedback loop occurs.The paper also empirically shows how this model exhibits such problems, and demonstrates ways to change the inputs to a predictive policing system (in a black-box manner) so the runaway feedback loop does not occur, allowing the true crime rate to be learned. Key takeaways: The results stated in the paper establish a link between the degree to which runaway feedback causes problems and the disparity in crime rates between areas. The paper also demonstrates ways in which reported incidents of crime (reported by residents) and discovered incidents of crime (directly observed by police officers dispatched as a result of the predictive policing algorithm) interact. In this paper, the authors have used the theory of urns (a common framework in reinforcement learning) to analyze existing methods for predictive policing. There are formal as well as empirical results which shows why these methods will not work. Subsequently, the authors have also provided remedies that can be used directly with these methods in a black-box fashion that improve their behavior, and provide theoretical justification for these remedies. Paper 2: All The Cool Kids, How Do They Fit In? Popularity and Demographic Biases in Recommender Evaluation and Effectiveness There have been many advances in the information retrieval evaluation, which demonstrate the importance of considering the distribution of effectiveness across diverse groups of varying sizes. This paper addresses this question, ‘do users of different ages or genders obtain similar utility from the system, particularly if their group is a relatively small subset of the user base?’ The authors have applied this consideration to recommender systems, using offline evaluation and a utility-based metric of recommendation effectiveness to explore whether different user demographic groups experience similar recommendation accuracy. The paper shows that there are demographic differences in measured recommender effectiveness across two data sets containing different types of feedback in different domains; these differences sometimes, but not always, correlate with the size of the user group in question. Demographic effects also have a complex— and likely detrimental—interaction with popularity bias, a known deficiency of recommender evaluation. Key takeaways: The paper presents an empirical analysis of the effectiveness of collaborative filtering recommendation strategies, stratified by the gender and age of the users in the data set. The authors applied widely-used recommendation techniques across two domains, musical artists and movies, using publicly-available data. The paper explains whether recommender systems produced equal utility for users of different demographic groups. The authors made use of publicly available datasets, they compared the utility, as measured with nDCG, for users grouped by age and gender. Regardless of the recommender strategy considered, they found significant differences for the nDCG among demographic groups. Paper 3: Recommendation Independence In this paper the authors have showcased new methods that can deal with variance of recommendation outcomes without increasing the computational complexity. These methods can more strictly remove the sensitive information, and experimental results demonstrate that the new algorithms can more effectively eliminate the factors that undermine fairness. Additionally, the paper also explores potential applications for independence enhanced recommendation, and discuss its relation to other concepts, such as recommendation diversity. Key takeaways from the paper: The authors have developed new independence-enhanced recommendation models that can deal with the second moment of distributions without sacrificing computational efficiency. The paper also explores applications in which recommendation independence would be useful, and reveal the relation of independence to the other concepts in recommendation research. It also presents the concept of recommendation independence, and discuss how the concept would be useful for solving real-world problems. Paper 4: Balanced Neighborhoods for Multi-sided Fairness in Recommendation In this paper, the authors examine two different cases of fairness-aware recommender systems: consumer-centered and provider-centered. The paper explores the concept of a balanced neighborhood as a mechanism to preserve personalization in recommendation while enhancing the fairness of recommendation outcomes. It shows that a modified version of the Sparse Linear Method (SLIM) can be used to improve the balance of user and item neighborhoods, with the result of achieving greater outcome fairness in real-world datasets with minimal loss in ranking performance. Key takeaways: In this paper, the authors examine applications in which fairness with respect to consumers and to item providers is important. They have shown that variants of the well-known sparse linear method (SLIM) can be used to negotiate the tradeoff between fairness and accuracy. This paper also introduces the concept of multisided fairness, relevant in multisided platforms that serve a matchmaking function. It demonstrates that the concept of balanced neighborhoods in conjunction with the well-known sparse linear method can be used to balance personalization with fairness considerations. If you’ve missed our summaries on the previous sessions, visit the article links to be on track. Session 1: Online Discrimination and Privacy Session 2: Interpretability and Explainability Session 3: Fairness in Computer Vision and NLP Session 4: Fair Classification

0
0
1678

article-image-getting-started-with-pentaho-data-integration-and-pentaho-bi-suite

Vijin Boricha

24 Feb 2018

9 min read

Getting Started with Pentaho Data Integration and Pentaho BI Suite

Vijin Boricha

24 Feb 2018

9 min read

0
0
3936

article-image-how-to-implement-dynamic-sql-in-postgresql-10

Amey Varangaonkar

23 Feb 2018

7 min read

How to implement Dynamic SQL in PostgreSQL 10

Amey Varangaonkar

23 Feb 2018

7 min read

In this PostgreSQL tutorial, we'll take a close look at the concept of dynamic SQL, and how it can make the life of database programmers easy by allowing efficient querying of data. This tutorial has been taken from the second edition of Learning PostgreSQL 10. You can read more here. Dynamic SQL is used to reduce repetitive tasks when it comes to querying. For example, one could use dynamic SQL to create table partitioning for a certain table on a daily basis, to add missing indexes on all foreign keys, or add data auditing capabilities to a certain table without major coding effects. Another important use of dynamic SQL is to overcome the side effects of PL/pgSQL caching, as queries executed using the EXECUTE statement are not cached. Dynamic SQL is achieved via the EXECUTE statement. The EXECUTE statement accepts a string and simply evaluates it. The synopsis to execute a statement is given as follows: EXECUTE command-string [ INTO [STRICT] target ] [ USING expression [, ...] ]; Executing DDL statements in dynamic SQL In some cases, one needs to perform operations at the database object level, such as tables, indexes, columns, roles, and so on. For example, a database developer would like to vacuum and analyze a specific schema object, which is a common task after the deployment in order to update the statistics. For example, to analyze the car_portal_app schema tables, one could write the following script: DO $$ DECLARE table_name text; BEGIN FOR table_name IN SELECT tablename FROM pg_tables WHERE schemaname ='car_portal_app' LOOP RAISE NOTICE 'Analyzing %', table_name; EXECUTE 'ANALYZE car_portal_app.' || table_name; END LOOP; END; $$; Executing DML statements in dynamic SQL Some applications might interact with data in an interactive manner. For example, one might have billing data generated on a monthly basis. Also, some applications filter data on different criteria defined by the user. In such cases, dynamic SQL is very convenient. For example, in the car portal application, the search functionality is needed to get accounts using the dynamic predicate, as follows: CREATE OR REPLACE FUNCTION car_portal_app.get_account (predicate TEXT) RETURNS SETOF car_portal_app.account AS $$ BEGIN RETURN QUERY EXECUTE 'SELECT * FROM car_portal_app.account WHERE ' || predicate; END; $$ LANGUAGE plpgsql; To test the previous function: car_portal=> SELECT * FROM car_portal_app.get_account ('true') limit 1; account_id | first_name | last_name | email | password ------------+------------+-----------+-----------------+------------------- --------------- 1 | James | Butt | [email protected] | 1b9ef408e82e38346e6ebebf2dcc5ece (1 row) car_portal=> SELECT * FROM car_portal_app.get_account (E'first_name='James''); account_id | first_name | last_name | email | password ------------+------------+-----------+-----------------+------------------- --------------- 1 | James | Butt | [email protected] | 1b9ef408e82e38346e6ebebf2dcc5ece (1 row) Dynamic SQL and the caching effect As mentioned earlier, PL/pgSQL caches execution plans. This is quite good if the generated plan is expected to be static. For example, the following statement is expected to use an index scan because of selectivity. In this case, caching the plan saves some time and thus increases performance: SELECT * FROM account WHERE account_id =<INT> In other scenarios, however, this is not true. For example, let's assume we have an index on the advertisement_date column and we would like to get the number of advertisements since a certain date, as follows: SELECT count (*) FROM car_portal_app.advertisement WHERE advertisement_date >= <certain_date>; In the preceding query, the entries from the advertisement table can be fetched from the hard disk either by using the index scan or using the sequential scan based on selectivity, which depends on the provided certain_date value. Caching the execution plan of such a query will cause serious problems; thus, writing the function as follows is not a good idea: CREATE OR REPLACE FUNCTION car_portal_app.get_advertisement_count (some_date timestamptz ) RETURNS BIGINT AS $$ BEGIN RETURN (SELECT count (*) FROM car_portal_app.advertisement WHERE advertisement_date >=some_date)::bigint; END; $$ LANGUAGE plpgsql; To solve the caching issue, one could rewrite the previous function either using the SQL language function or by using the PL/pgSQL execute command, as follows: CREATE OR REPLACE FUNCTION car_portal_app.get_advertisement_count (some_date timestamptz ) RETURNS BIGINT AS $$ DECLARE count BIGINT; BEGIN EXECUTE 'SELECT count (*) FROM car_portal_app.advertisement WHERE advertisement_date >= $1' USING some_date INTO count; RETURN count; END; $$ LANGUAGE plpgsql; Recommended practices for dynamic SQL usage Dynamic SQL can cause security issues if not handled carefully; dynamic SQL is vulnerable to the SQL injection technique. SQL injection is used to execute SQL statements that reveal secure information, or even to destroy data in a database. A very simple example of a PL/pgSQL function vulnerable to SQL injection is as follows: CREATE OR REPLACE FUNCTION car_portal_app.can_login (email text, pass text) RETURNS BOOLEAN AS $$ DECLARE stmt TEXT; result bool; BEGIN stmt = E'SELECT COALESCE (count(*)=1, false) FROM car_portal_app.account WHERE email = ''|| $1 || E'' and password = ''||$2||E'''; RAISE NOTICE '%' , stmt; EXECUTE stmt INTO result; RETURN result; END; $$ LANGUAGE plpgsql; The preceding function returns true if the email and the password match. To test this function, let's insert a row and try to inject some code, as follows: car_portal=> SELECT car_portal_app.can_login('[email protected]', md5('[email protected]')); NOTICE: SELECT COALESCE (count(*)=1, false) FROM account WHERE email = '[email protected]' and password = '1b9ef408e82e38346e6ebebf2dcc5ece' Can_login ----------- t (1 row) car_portal=> SELECT car_portal_app.can_login('[email protected]', md5('[email protected]')); NOTICE: SELECT COALESCE (count(*)=1, false) FROM account WHERE email = '[email protected]' and password = '37eb43e4d439589d274b6f921b1e4a0d' can_login ----------- f (1 row) car_portal=> SELECT car_portal_app.can_login(E'[email protected]'--', 'Do not know password'); NOTICE: SELECT COALESCE (count(*)=1, false) FROM account WHERE email = '[email protected]'--' and password = 'Do not know password' can_login ----------- t (1 row) Notice that the function returns true even when the password does not match the password stored in the table. This is simply because the predicate was commented, as shown by the raise notice: SELECT COALESCE (count(*)=1, false) FROM account WHERE email = '[email protected]'--' and password = 'Do not know password' To protect code against this technique, one could follow these practices: For parameterized dynamic SQL statements, use the USING clause. Use the format function with appropriate interpolation to construct your queries. Note that %I escapes the argument as an identifier and %L as a literal. Use quote_ident(), quote_literal(), and quote_nullable() to properly format your identifiers and literal. One way to write the preceding function is as follows: CREATE OR REPLACE FUNCTION car_portal_app.can_login (email text, pass text) RETURNS BOOLEAN AS $$ DECLARE stmt TEXT; result bool; BEGIN stmt = format('SELECT COALESCE (count(*)=1, false) FROM car_portal_app.account WHERE email = %Land password = %L', $1,$2); RAISE NOTICE '%' , stmt; EXECUTE stmt INTO result; RETURN result; END; $$ LANGUAGE plpgsql; We saw how dynamically SQL is used to build and execute queries on the fly. Unlike the static SQL statement, a dynamic SQL statements’ full text is unknown and can change between successive executions. These queries can be DDL, DCL, and/or DML statements. If you found this article useful, make sure to check out the book Learning PostgreSQL 10, to learn the fundamentals of PostgreSQL 10.

0
1
56401

article-image-session-4-fair-classification

Sugandha Lahoti

23 Feb 2018

7 min read

FAT Conference 2018 Session 4: Fair Classification

Sugandha Lahoti

23 Feb 2018

7 min read

As algorithms are increasingly used to make decisions of social consequence, the social values encoded in these decision-making procedures are the subject of increasing study, with fairness being a chief concern. The Conference on Fairness, Accountability, and Transparency (FAT) scheduled on Feb 23 and 24 this year in New York is an annual conference dedicated to bringing theory and practice of fair and interpretable Machine Learning, Information Retrieval, NLP, Computer Vision, Recommender systems, and other technical disciplines. This year's program includes 17 peer-reviewed papers and 6 tutorials from leading experts in the field. The conference will have three sessions. Session 4 of the two-day conference on Saturday, February 24, is in the field of fair classification. In this article, we give our readers a peek into the four papers that have been selected for presentation in Session 4. You can also check out Session 1, Session 2, and Session 3 summaries in case you’ve missed them. The cost of fairness in binary classification What is the paper about? This paper provides a simple approach to the Fairness-aware problem which involves suitably thresholding class-probability estimates. It has been awarded Best paper in Technical contribution category. The authors have studied the inherent tradeoffs in learning classifiers with a fairness constraint in the form of two questions: What is the best accuracy we can expect for a given level of fairness? What is the nature of these optimal fairness aware classifiers? The authors showed that for cost-sensitive approximate fairness measures, the optimal classifier is an instance-dependent thresholding of the class probability function. They have quantified the degradation in performance by a measure of alignment of the target and sensitive variable. This analysis is then used to derive a simple plugin approach for the fairness problem. Key takeaways For Fairness-aware learning, the authors have designed an algorithm targeting a particular measure of fairness. They have reduced two popular fairness measures (disparate impact and mean difference) to cost-sensitive risks. They show that for cost-sensitive fairness measures, the optimal Fairness-aware classifier is an instance-dependent thresholding of the class-probability function. They quantify the intrinsic, method independent impact of the fairness requirement on accuracy via a notion of alignment between the target and sensitive feature. The ability to theoretically compute the tradeoffs between fairness and utility is perhaps the most interesting aspect of their technical results. They have stressed that the tradeoff is intrinsic to the underlying data. That is, any fairness or unfairness, is a property of the data, not of any particular technique. They have theoretically computed what price one has to pay (in utility) in order to achieve a desired degree of fairness: in other words, they have computed the cost of fairness. Decoupled Classifiers for Group-Fair and Efficient Machine Learning What is the paper about? This paper considers how to use a sensitive attribute such as gender or race to maximize fairness and accuracy, assuming that it is legal and ethical. Simple linear classifiers may use the raw data, upweight/oversample data from minority groups, or employ advanced approaches to fitting linear classifiers that aim to be accurate and fair. However, an inherent tradeoff between accuracy on one group and accuracy on another still prevails. This paper defines and explores decoupled classification systems, in which a separate classifier is trained on each group. The authors present experiments on 47 datasets. The experiments are “semi-synthetic” in the sense that the first binary feature was used as a substitute sensitive feature. The authors found that on many data sets the decoupling algorithm improves performance while less often decreasing performance. Key takeaways The paper describes a simple technical approach for a practitioner using ML to incorporate sensitive attributes. This approach avoids unnecessary accuracy tradeoffs between groups and can accommodate an application-specific objective, generalizing the standard ML notion of loss. For a certain family of “weakly monotonic” fairness objectives, the authors provide a black-box reduction that can use any off-the-shelf classifier to efficiently optimize the objective. This work requires the application designer to pin down a specific loss function that trades off accuracy for fairness. Experiments demonstrate that decoupling can reduce the loss on some datasets for some potentially sensitive features A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions What is the paper about? The work is based on the use of predictive analytics in the area of child welfare. It won the best paper award in the Technical and Interdisciplinary Contribution. The authors have worked on developing, validating, fairness auditing, and deploying a risk prediction model in Allegheny County, PA, USA. The authors have described competing models that are being developed in the Allegheny County as part of an ongoing redesign process in comparison to the previous models. Next, they investigate the predictive bias properties of the current tool and a Random forest model that has emerged as one of the best performing competing models. Their predictive bias assessment is motivated both by considerations of human bias and recent work on fairness criteria. They then discuss some of the challenges in incorporating algorithms into human decision-making processes and reflect on the predictive bias analysis in the context of how the model is actually being used. They also propose an “oracle test” as a tool for clarifying whether particular concerns pertain to the statistical properties of a model or if these concerns are targeted at other potential deficiencies. Key takeaways The goal in Allegheny County is to improve both the accuracy and equity of screening decisions by taking a Fairness-aware approach to incorporating prediction models into the decision-making pipeline. The paper reports on the lessons learned so far by the authors, their approaches to predictive bias assessment, and several outstanding challenges in the child maltreatment hotline context. This report contributes to the ongoing conversation concerning the use of algorithms in supporting critical decisions in government—and the importance of considering fairness and discrimination in data-driven decision making. The paper discussion and general analytic approach are also broadly applicable to other domains where predictive risk modeling may be used. Fairness in Machine Learning: Lessons from Political Philosophy What is the paper about? Plenty of moral and political philosophers have expended significant efforts in formalizing and defending the central concepts of discrimination, egalitarianism, and justice. Thus it is unsurprising to know that the attempts to formalize ‘fairness’ in machine learning contain echoes of these old philosophical debates. This paper draws on existing work in moral and political philosophy in order to elucidate emerging debates about fair machine learning. It answers the following questions: What does it mean for a machine learning model to be ‘fair’, in terms which can be operationalized? Should fairness consist of ensuring everyone has an equal probability of obtaining some benefit, or should we aim instead to minimize the harms to the least advantaged? Can the relevant ideal be determined by reference to some alternative state of affairs in which a particular social pattern of discrimination does not exist? Key takeaways This paper aims to provide an overview of some of the relevant philosophical literature on discrimination, fairness, and egalitarianism in order to clarify and situate the emerging debate within fair machine learning literature. The author addresses the conceptual distinctions drawn between terms frequently used in the fair ML literature–including ‘discrimination’ and ‘fairness’–and the use of related terms in the philosophical literature. He suggests that ‘fairness’ as used in the fair machine learning community is best understood as a placeholder term for a variety of normative egalitarian considerations. He also provides an overview of implications for the incorporation of ‘fairness’ into algorithmic decision-making systems. We hope you like the coverage of Session 4. Don’t miss our coverage on Session 5 on Fat recommenders and more.

0
0
2047

article-image-working-with-pandas-dataframes

Sugandha Lahoti

23 Feb 2018

15 min read

Working with pandas DataFrames

Sugandha Lahoti

23 Feb 2018

15 min read

[box type="note" align="" class="" width=""]This article is an excerpt from the book Python Data Analysis - Second Edition written by Armando Fandango. From this book, you will learn how to process and manipulate data with Python for complex data analysis and modeling. Code bundle for this article is hosted on GitHub.[/box] The popular open source Python library, pandas is named after panel data (an econometric term) and Python data analysis. We shall learn about basic panda functionalities, data structures, and operations in this article. The official pandas documentation insists on naming the project pandas in all lowercase letters. The other convention the pandas project insists on, is the import pandas as pd import statement. We will follow these conventions in this text. In this tutorial, we will install and explore pandas. We will also acquaint ourselves with the a central pandas data structure–DataFrame. Installing and exploring pandas The minimal dependency set requirements for pandas is given as follows: NumPy: This is the fundamental numerical array package that we installed and covered extensively in the preceding chapters python-dateutil: This is a date handling library pytz: This handles time zone definitions This list is the bare minimum; a longer list of optional dependencies can be located at http://pandas.pydata.org/pandas-docs/stable/install.html. We can install pandas via PyPI with pip or easy_install, using a binary installer, with the aid of our operating system package manager, or from the source by checking out the code. The binary installers can be downloaded from http://pandas.pydata.org/getpandas.html. The command to install pandas with pip is as follows: $ pip3 install pandas rpy2 rpy2 is an interface to R and is required because rpy is being deprecated. You may have to prepend the preceding command with sudo if your user account doesn't have sufficient rights. The pandas DataFrames A pandas DataFrame is a labeled two-dimensional data structure and is similar in spirit to a worksheet in Google Sheets or Microsoft Excel, or a relational database table. The columns in pandas DataFrame can be of different types. A similar concept, by the way, was invented originally in the R programming language. (For more information, refer to http://www.r-tutor.com/r-introduction/data-frame). A DataFrame can be created in the following ways: Using another DataFrame. Using a NumPy array or a composite of arrays that has a two-dimensional shape. Likewise, we can create a DataFrame out of another pandas data structure called Series. We will learn about Series in the following section. A DataFrame can also be produced from a file, such as a CSV file. From a dictionary of one-dimensional structures, such as one-dimensional NumPy arrays, lists, dicts, or pandas Series. As an example, we will use data that can be retrieved from http://www.exploredata.net/Downloads/WHO-Data-Set. The original data file is quite large and has many columns, so we will use an edited file instead, which only contains the first nine columns and is called WHO_first9cols.csv; the file is in the code bundle of this book. These are the first two lines, including the header: Country,CountryID,Continent,Adolescent fertility rate (%),Adult literacy rate (%),Gross national income per capita (PPP international $),Net primary school enrolment ratio female (%),Net primary school enrolment ratio male (%),Population (in thousands) totalAfghanistan,1,1,151,28,,,,26088 In the next steps, we will take a look at pandas DataFrames and its attributes: To kick off, load the data file into a DataFrame and print it on the screen: from pandas.io.parsers import read_csv df = read_csv("WHO_first9cols.csv") print("Dataframe", df) The printout is a summary of the DataFrame. It is too long to be displayed entirely, so we will just grab the last few lines: 199 21732.0 200 11696.0 201 13228.0 [202 rows x 9 columns] The DataFrame has an attribute that holds its shape as a tuple, similar to ndarray. Query the number of rows of a DataFrame as follows: print("Shape", df.shape) print("Length", len(df)) The values we obtain comply with the printout of the preceding step: Shape (202, 9) Length 202 Check the column header and data types with the other attributes: print("Column Headers", df.columns) print("Data types", df.dtypes) We receive the column headers in a special data structure: Column Headers Index([u'Country', u'CountryID', u'Continent', u'Adolescent fertility rate (%)', u'Adult literacy rate (%)', u'Gross national income per capita (PPP international $)', u'Net primary school enrolment ratio female (%)', u'Net primary school enrolment ratio male (%)', u'Population (in thousands) total'], dtype='object') The data types are printed as follows: 4. The pandas DataFrame has an index, which is like the primary key of relational database tables. We can either specify the index or have pandas create it automatically. The index can be accessed with a corresponding property, as follows: Print("Index", df.index) An index helps us search for items quickly, just like the index in this book. In our case, the index is a wrapper around an array starting at 0, with an increment of one for each row: Sometimes, we wish to iterate over the underlying data of a DataFrame. Iterating over column values can be inefficient if we utilize the pandas iterators. It's much better to extract the underlying NumPy arrays and work with those. The pandas DataFrame has an attribute that can aid with this as well: print("Values", df.values) Please note that some values are designated nan in the output, for 'not a number'. These values come from empty fields in the input datafile: The preceding code is available in Python Notebook ch-03.ipynb, available in the code bundle of this book. Querying data in pandas Since a pandas DataFrame is structured in a similar way to a relational database, we can view operations that read data from a DataFrame as a query. In this example, we will retrieve the annual sunspot data from Quandl. We can either use the Quandl API or download the data manually as a CSV file from http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual. If you want to install the API, you can do so by downloading installers from https://pypi.python.org/pypi/Quandl or by running the following command: $ pip3 install Quandl Using the API is free, but is limited to 50 API calls per day. If you require more API calls, you will have to request an authentication key. The code in this tutorial is not using a key. It should be simple to change the code to either use a key or read a downloaded CSV file. If you have difficulties, search through the Python docs at https://docs.python.org/2/. Without further preamble, let's take a look at how to query data in a pandas DataFrame: As a first step, we obviously have to download the data. After importing the Quandl API, get the data as follows: import quandl # Data from http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual # PyPi url https://pypi.python.org/pypi/Quandl sunspots = quandl.get("SIDC/SUNSPOTS_A") The head() and tail() methods have a purpose similar to that of the Unix commands with the same name. Select the first n and last n records of a DataFrame, where n is an integer parameter: print("Head 2", sunspots.head(2) ) print("Tail 2", sunspots.tail(2)) This gives us the first two and last two rows of the sunspot data (for the sake of brevity we have not shown all the columns here; your output will have all the columns from the dataset): Head 2 Number Year 1700-12-31 5 1701-12-31 11 [2 rows x 1 columns] Tail 2 Number Year 2012-12-31 57.7 2013-12-31 64.9 [2 rows x 1 columns] Please note that we only have one column holding the number of sunspots per year. The dates are a part of the DataFrame index. The following is the query for the last value using the last date: last_date = sunspots.index[-1] print("Last value", sunspots.loc[last_date]) You can check the following output with the result from the previous step: Last value Number 64.9 Name: 2013-12-31 00:00:00, dtype: float64 Query the date with date strings in the YYYYMMDD format as follows: print("Values slice by date:n", sunspots["20020101": "20131231"]) This gives the records from 2002 through to 2013: Values slice by date Number Year 2002-12-31 104.0 [TRUNCATED] 2013-12-31 64.9 [12 rows x 1 columns] A list of indices can be used to query as well: print("Slice from a list of indices:n", sunspots.iloc[[2, 4, -4, -2]]) The preceding code selects the following rows: Slice from a list of indices Number Year 1702-12-31 16.0 1704-12-31 36.0 2010-12-31 16.0 2012-12-31 57.7 [4 rows x 1 columns] To select scalar values, we have two options. The second option given here should be faster. Two integers are required, the first for the row and the second for the column: print("Scalar with Iloc:", sunspots.iloc[0, 0]) print("Scalar with iat", sunspots.iat[1, 0]) This gives us the first and second values of the dataset as scalars: Scalar with Iloc 5.0 Scalar with iat 11.0 Querying with Booleans works much like the Where clause of SQL. The following code queries for values larger than the arithmetic mean. Note that there is a difference between when we perform the query on the whole DataFrame and when we perform it on a single column: print("Boolean selection", sunspots[sunspots > sunspots.mean()]) print("Boolean selection with column label:n", sunspots[sunspots['Number of Observations'] > sunspots['Number of Observations'].mean()]) The notable difference is that the first query yields all the rows, with some rows not conforming to the condition that has a value of NaN. The second query returns only the rows where the value is larger than the mean: Boolean selection Number Year 1700-12-31 NaN [TRUNCATED] 1759-12-31 54.0 ... [314 rows x 1 columns] Boolean selection with column label Number Year 1705-12-31 58.0 [TRUNCATED] 1870-12-31 139.1 ... [127 rows x 1 columns] The preceding example code is in the ch_03.ipynb file of this book's code bundle. Data aggregation with pandas DataFrames Data aggregation is a term used in the field of relational databases. In a database query, we can group data by the value in a column or columns. We can then perform various operations on each of these groups. The pandas DataFrame has similar capabilities. We will generate data held in a Python dict and then use this data to create a pandas DataFrame. We will then practice the pandas aggregation features: Seed the NumPy random generator to make sure that the generated data will not differ between repeated program runs. The data will have four columns: Weather (a string) Food (also a string) Price (a random float) Number (a random integer between one and nine) The use case is that we have the results of some sort of consumer-purchase research, combined with weather and market pricing, where we calculate the average of prices and keep a track of the sample size and parameters: import pandas as pd from numpy.random import seed from numpy.random import rand from numpy.random import rand_int import numpy as np seed(42) df = pd.DataFrame({'Weather' : ['cold', 'hot', 'cold','hot', 'cold', 'hot', 'cold'], 'Food' : ['soup', 'soup', 'icecream', 'chocolate', 'icecream', 'icecream', 'soup'], 'Price' : 10 * rand(7), 'Number' : rand_int(1, 9,)}) print(df) You should get an output similar to the following: Please note that the column labels come from the lexically ordered keys of the Python dict. Lexical or lexicographical order is based on the alphabetic order of characters in a string. Group the data by the Weather column and then iterate through the groups as follows: weather_group = df.groupby('Weather') i = 0 for name, group in weather_group: i = i + 1 print("Group", i, name) print(group) We have two types of weather, hot and cold, so we get two groups: The weather_group variable is a special pandas object that we get as a result of the groupby() method. This object has aggregation methods, which are demonstrated as follows: print("Weather group firstn", weather_group.first()) print("Weather group lastn", weather_group.last()) print("Weather group meann", weather_group.mean()) The preceding code snippet prints the first row, last row, and mean of each group: Just as in a database query, we are allowed to group on multiple columns. The groups attribute will then tell us the groups that are formed, as well as the rows in each group: wf_group = df.groupby(['Weather', 'Food']) print("WF Groups", wf_group.groups) For each possible combination of weather and food values, a new group is created. The membership of each row is indicated by their index values as follows: WF Groups {('hot', 'chocolate'): [3], ('cold', 'icecream'): [2, 4], ('hot', 'icecream'): [5], ('hot', 'soup'): [1], ('cold', 'soup'): [0, 6] 5. Apply a list of NumPy functions on groups with the agg() method: print("WF Aggregatedn", wf_group.agg([np.mean, np.median])) Obviously, we could apply even more functions, but it would look messier than the following output: Concatenating and appending DataFrames The pandas DataFrame allows operations that are similar to the inner and outer joins of database tables. We can append and concatenate rows as well. To practice appending and concatenating of rows, we will reuse the DataFrame from the previous section. Let's select the first three rows: print("df :3n", df[:3]) Check that these are indeed the first three rows: df :3 Food Number Price Weather 0 soup 8 3.745401 cold 1 soup 5 9.507143 hot 2 icecream 4 7.319939 cold The concat() function concatenates DataFrames. For example, we can concatenate a DataFrame that consists of three rows to the rest of the rows, in order to recreate the original DataFrame: print("Concat Back togethern", pd.concat([df[:3], df[3:]])) The concatenation output appears as follows: Concat Back together Food Number Price Weather 0 soup 8 3.745401 cold 1 soup 5 9.507143 hot 2 icecream 4 7.319939 cold 3 chocolate 8 5.986585 hot 4 icecream 8 1.560186 cold 5 icecream 3 1.559945 hot 6 soup 6 0.580836 cold [7 rows x 4 columns] To append rows, use the append() function: print("Appending rowsn", df[:3].append(df[5:])) The result is a DataFrame with the first three rows of the original DataFrame and the last two rows appended to it: Appending rows Food Number Price Weather 0 soup 8 3.745401 cold 1 soup 5 9.507143 hot 2 icecream 4 7.319939 cold 5 icecream 3 1.559945 hot 6 soup 6 0.580836 cold [5 rows x 4 columns] Joining DataFrames To demonstrate joining, we will use two CSV files-dest.csv and tips.csv. The use case behind it is that we are running a taxi company. Every time a passenger is dropped off at his or her destination, we add a row to the dest.csv file with the employee number of the driver and the destination: EmpNr,Dest5,The Hague3,Amsterdam9,Rotterdam Sometimes drivers get a tip, so we want that registered in the tips.csv file (if this doesn't seem realistic, please feel free to come up with your own story): EmpNr,Amount5,109,57,2.5 Database-like joins in pandas can be done with either the merge() function or the join() DataFrame method. The join() method joins onto indices by default, which might not be what you want. In SQL a relational database query language we have the inner join, left outer join, right outer join, and full outer join. An inner join selects rows from two tables, if and only if values match, for columns specified in the join condition. Outer joins do not require a match, and can potentially return more rows. More information on joins can be found at http://en.wikipedia.org/wiki/Join_%28SQL%29. All these join types are supported by pandas, but we will only take a look at inner joins and full outer joins: A join on the employee number with the merge() function is performed as follows: print("Merge() on keyn", pd.merge(dests, tips, on='EmpNr')) This gives an inner join as the outcome: Merge() on key EmpNr Dest Amount 0 5 The Hague 10 1 9 Rotterdam 5 [2 rows x 3 columns] Joining with the join() method requires providing suffixes for the left and right operands: print("Dests join() tipsn", dests.join(tips, lsuffix='Dest', rsuffix='Tips')) This method call joins index values so that the result is different from an SQL inner join: Dests join() tips EmpNrDest Dest EmpNrTips Amount 0 5 The Hague 5 10.0 1 3 Amsterdam 9 5.0 2 9 Rotterdam 7 2.5 [3 rows x 4 columns] An even more explicit way to execute an inner join with merge() is as follows: print("Inner join with merge()n", pd.merge(dests, tips, how='inner')) The output is as follows: Inner join with merge() EmpNr Dest Amount 0 5 The Hague 10 1 9 Rotterdam 5 [2 rows x 3 columns] To make this a full outer join requires only a small change: print("Outer joinn", pd.merge(dests, tips, how='outer')) The outer join adds rows with NaN values: Outer join EmpNr Dest Amount 0 5 The Hague 10.0 1 3 Amsterdam NaN 2 9 Rotterdam 5.0 3 7 NaN 2.5 [4 rows x 3 columns] In a relational database query, these values would have been set to NULL. The demo code is in the ch-03.ipynb file of this book's code bundle. We learnt how to perform various data manipulation techniques such as aggregating, concatenating, appending, cleaning, and handling missing values, with pandas. If you found this post useful, check out the book Python Data Analysis - Second Edition to learn advanced topics such as signal processing, textual data analysis, machine learning, and more.

0
0
6574

article-image-session-3-fairness-in-computer-vision-and-nlp

Sugandha Lahoti

23 Feb 2018

6 min read

FAT Conference 2018 Session 3: Fairness in Computer Vision and NLP

Sugandha Lahoti

23 Feb 2018

6 min read

Machine learning has emerged with a vast new ecosystem of techniques and infrastructure and we are just beginning to learn their full capabilities. But with the exciting innovations happening, there are also some really concerning problems arising. Forms of bias, stereotyping and unfair determination are being found in computer vision systems, object recognition models, and in natural language processing and word embeddings. The Conference on Fairness, Accountability, and Transparency (FAT) scheduled on Feb 23 and 24 this year in New York is an annual conference dedicating to bringing theory and practice of fair and interpretable Machine Learning, Information Retrieval, NLP, Computer Vision, Recommender systems, and other technical disciplines. This year's program includes 17 peer-reviewed papers and 6 tutorials from leading experts in the field. The conference will have three sessions. Session 3 of the two-day conference on Saturday, February 24, is in the field of fairness in computer vision and NLP. In this article, we give our readers a peek into the three papers that have been selected for presentation in Session 3. You can also check out Session 1 and Session 2, in case you’ve missed them. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification What is the paper about The paper talks about substantial disparities in the accuracy of classifying darker and lighter females and males in gender classification systems. The authors have evaluated bias present in automated facial analysis algorithms and datasets with respect to phenotypic subgroups. Using the dermatologist approved Fitzpatrick Skin Type classification system, they have characterized the gender and skin type distribution of two facial analysis benchmarks, IJB-A and Adience. They have also evaluated 3 commercial gender classification systems using this dataset. Key takeaways The paper measures accuracy of 3 commercial gender classification algorithms by Microsoft, IBM, and Face++ on the new Pilot Parliaments Benchmark which is balanced by gender and skin type. On annotating the dataset with the Fitzpatrick skin classification system and testing gender classification performance on 4 subgroups, they found : All classifiers perform better on male faces than on female faces (8.1% − 20.6% difference in error rate) All classifiers perform better on lighter faces than darker faces (11.8% − 19.2% difference in error rate) All classifiers perform worst on darker female faces (20.8% − 34.7% error rate) Microsoft and IBM classifiers perform best on lighter male faces (error rates of 0.0% and 0.3% respectively) Face++ classifiers perform best on darker male faces (0.7% error rate) The maximum difference in error rate between the best and worst classified groups is 34.4% They encourage further work to see if the substantial error rate gaps on the basis of gender, skin type and intersectional subgroup revealed in this study of gender classification persist in other human-based computer vision tasks as well. Analyze, Detect and Remove Gender Stereotyping from Bollywood Movies What is the paper about The paper studies gender stereotypes and cases of bias in the Hindi movie industry (Bollywood) and propose an algorithm to remove these stereotypes from text. The authors have analyzed movie plots and posters for all movies released since 1970. The gender bias is detected by semantic modeling of plots at sentence and intra-sentence level. Different features like occupation, introductions, associated actions and descriptions are captured to show the pervasiveness of gender bias and stereotype in movies. Next, they have developed an algorithm to generate debiased stories. The proposed debiasing algorithm extracts gender biased graphs from unstructured piece of text in stories from movies and de-bias these graphs to generate plausible unbiased stories. Key takeaways The analysis is performed at sentence at multi-sentence level and uses word embeddings by adding context vector and studying the bias in data. Data observation showed that while analyzing occupations for males and females, higher level roles are designated to males while lower level roles are designated to females. A similar trend has been observed for centrality where females were less central in the plot vs their male counterparts. Also, while predicting gender using context word vectors, with very small training data, a very high accuracy was observed in gender prediction for test data reflecting a substantial amount of bias present in the data. The authors have also presented an algorithm to remove such bias present in text. They show that by interchanging the gender of high centrality male character with a high centrality female character in the plot text, leaves no change in the story but de-biases it completely. Mixed Messages? The Limits of Automated Social Media Content Analysis What is the paper about This paper broadcasts that a knowledge gap exists between data scientists studying NLP and policymakers advocating for the wide adoption of automated social media analysis and moderation. It urges policymakers to understand the capabilities and limits of NLP before endorsing or adopting automated content analysis tools, particularly for making decisions that affect fundamental rights or access to government benefits. It draws on existing research to explain the capabilities and limitations of text classifiers for social media posts and other online content. This paper is aimed at helping researchers and technical experts address the gaps in policymakers knowledge about what is possible with automated text analysis. Key takeaways The authors have provided an overview of how NLP classifiers work and identified five key limitations of these tools that must be communicated to policymakers: NLP classifiers require domain-specific training and cannot be applied with the same reliability across different domains. NLP tools can amplify social bias reflected in language and are likely to have lower accuracy for minority groups. Accurate text classification requires clear, consistent definitions of the type of speech to be identified. Policy debates around content moderation and social media mining tend to lack such precise definitions. The accuracy achieved in NLP studies does not warrant widespread application of these tools to social media content analysis and moderation. Text filters remain easy to evade and fall far short of humans ability to parse meaning from text. The paper concludes with recommendations for NLP researchers to bridge the knowledge gap between technical experts and policymakers, including Clearly describe the domain limitations of NLP tools. Increase development of non-English training resources. Provide more detail and context for accuracy measures. Publish more information about definitions and instructions provided to annotators. Don’t miss our coverage on Session 4 and Session 5 on Fair Classification, Fat recommenders, etc.

0
0
2611

article-image-getting-started-with-apache-kafka-clusters

Amarabha Banerjee

23 Feb 2018

10 min read

Getting Started with Apache Kafka Clusters

Amarabha Banerjee

23 Feb 2018

10 min read

[box type="note" align="" class="" width=""]Below given article is a book excerpt from Apache Kafka 1.0 Cookbook written by Raúl Estrada. This book contains easy to follow recipes to help you set-up, configure and use Apache Kafka in the best possible manner.[/box] Here in this article, we are going to talk about how you can get started with Apache Kafka clusters and implement them seamlessly. In Apache Kafka there are three types of clusters: Single-node single-broker Single-node multiple-broker Multiple-node multiple-broker cluster The following four recipes show how to run Apache Kafka in these clusters. Configuring a single-node single-broker cluster – SNSB The first cluster configuration is single-node single-broker (SNSB). This cluster is very useful when a single point of entry is needed. Yes, its architecture resembles the singleton design pattern. A SNSB cluster usually satisfies three requirements: Controls concurrent access to a unique shared broker Access to the broker is requested from multiple, disparate producers There can be only one broker If the proposed design has only one or two of these requirements, a redesign is almost always the correct option. Sometimes, the single broker could become a bottleneck or a single point of failure. But it is useful when a single point of communication is needed. Getting ready Go to the Kafka installation directory (/usr/local/kafka/ for macOS users and /opt/kafka/ for Linux users): > cd /usr/local/kafka How to do it... The diagram shows an example of an SNSB cluster: Starting ZooKeeper Kafka provides a simple ZooKeeper configuration file to launch a single ZooKeeper instance. To install the ZooKeeper instance, use this command: > bin/zookeeper-server-start.sh config/zookeeper.properties The main properties specified in the zookeeper.properties file are: clientPort: This is the listening port for client requests. By default, ZooKeeper listens on TCP port 2181: clientPort=2181 dataDir: This is the directory where ZooKeeper is stored: dataDir=/tmp/zookeeper means unbounded): maxClientCnxns=0 For more information about Apache ZooKeeper visit the project home page at: http://zookeeper.apache.org/. Starting the broker After ZooKeeper is started, start the Kafka broker with this command: > bin/kafka-server-start.sh config/server.properties The main properties specified in the server.properties file are: broker.id: The unique positive integer identifier for each broker: broker.id=0 log.dir: Directory to store log files: log.dir=/tmp/kafka10-logs num.partitions: The number of log partitions per topic: num.partitions=2 port: The port that the socket server listens on: port=9092 zookeeper.connect: The ZooKeeper URL connection: zookeeper.connect=localhost:2181 How it works Kafka uses ZooKeeper for storing metadata information about the brokers, topics, and partitions. Writes to ZooKeeper are performed only on changes of consumer group membership or on changes to the Kafka cluster itself. This amount of traffic is minimal, and there is no need for a dedicated ZooKeeper ensemble for a single Kafka cluster. Actually, many deployments use a single ZooKeeper ensemble to control multiple Kafka clusters (using a chroot ZooKeeper path for each cluster). SNSB – creating a topic, producer, and consumer The SNSB Kafka cluster is running; now let's create topics, producer, and consumer. Getting ready We need the previous recipe executed: Kafka already installed ZooKeeper up and running A Kafka server up and running Now, go to the Kafka installation directory (/usr/local/kafka/ for macOS users and /opt/kafka/ for Linux users): > cd /usr/local/kafka How to do it The following steps will show you how to create an SNSB topic, producer, and consumer. Creating a topic As we know, Kafka has a command to create topics. Here we create a topic called SNSBTopic with one partition and one replica: > bin/kafka-topics.sh --create --zookeeper localhost:2181 -- replication-factor 1 --partitions 1 --topic SNSBTopic We obtain the following output: Created topic "SNSBTopic". The command parameters are: --replication-factor 1: This indicates just one replica --partition 1: This indicates just one partition --zookeeper localhost:2181: This indicates the ZooKeeper URL As we know, to get the list of topics on a Kafka server we use the following command: > bin/kafka-topics.sh --list --zookeeper localhost:2181 We obtain the following output: SNSBTopic Starting the producer Kafka has a command to start producers that accepts inputs from the command line and publishes each input line as a message. By default, each new line is considered a message: > bin/kafka-console-producer.sh --broker-list localhost:9092 -- topic SNSBTopic This command requires two parameters: broker-list: The broker URL to connect to topic: The topic name (to send a message to the topic subscribers) Now, type the following in the command line: The best thing about a boolean is [Enter] even if you are wrong [Enter] you are only off by a bit. [Enter] This output is obtained (as expected): The best thing about a boolean is even if you are wrong you are only off by a bit. The producer.properties file has the producer configuration. Some important properties defined in the producer.properties file are: metadata.broker.list: The list of brokers used for bootstrapping information on the rest of the cluster in the format host1:port1, host2:port2: metadata.broker.list=localhost:9092 compression.codec: The compression codec used. For example, none, gzip, and snappy: compression.codec=none Starting the consumer Kafka has a command to start a message consumer client. It shows the output in the command line as soon as it has subscribed to the topic: > bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic SNSBTopic --from-beginning Note that the parameter from-beginning is to show the entire log: The best thing about a boolean is even if you are wrong you are only off by a bit. One important property defined in the consumer.properties file is: group.id: This string identifies the consumers in the same group: group.id=test-consumer-group There's more It is time to play with this technology. Open a new command-line window for ZooKeeper, a broker, two producers, and two consumers. Type some messages in the producers and watch them get displayed in the consumers. If you don't know or don't remember how to run the commands, run it with no arguments to display the possible values for the Parameters. Configuring a single-node multiple-broker cluster – SNMB The second cluster configuration is single-node multiple-broker (SNMB). This cluster is used when there is just one node but inner redundancy is needed. When a topic is created in Kafka, the system determines how each replica of a partition is mapped to each broker. In general, Kafka tries to spread the replicas across all available brokers. The messages are first sent to the first replica of a partition (to the current broker leader of that partition) before they are replicated to the remaining brokers. The producers may choose from different strategies for sending messages (synchronous or asynchronous mode). Producers discover the available brokers in a cluster and the partitions on each (all this by registering watchers in ZooKeeper). In practice, some of the high volume topics are configured with more than one partition per broker. Remember that having more partitions increases the I/O parallelism for writes and this increases the degree of parallelism for consumers (the partition is the unit for distributing data to consumers). On the other hand, increasing the number of partitions increases the overhead because: There are more files, so more open file handlers There are more offsets to be checked by consumers, so the ZooKeeper load is increased The art of this is to balance these tradeoffs. Getting ready Go to the Kafka installation directory (/usr/local/kafka/ for macOS users and /opt/kafka/ for Linux users): > cd /usr/local/kafka The following diagram shows an example of an SNMB cluster: How to do it Begin starting the ZooKeeper server as follows: > bin/zookeeper-server-start.sh config/zookeeper.properties A different server.properties file is needed for each broker. Let's call them: server-1.properties, server-2.properties, server-3.properties, and so on (original, isn't it?). Each file is a copy of the original server.properties file. In the server-1.properties file set the following properties: broker.id=1 port=9093 log.dir=/tmp/kafka-logs-1 Similarly, in the server-2.properties file set the following properties: broker.id=2 port=9094 log.dir=/tmp/kafka-logs-2 Finally, in the server-3.properties file set the following properties: broker.id=3 port=9095 log.dir=/tmp/kafka-logs-3 With ZooKeeper running, start the Kafka brokers with these commands: > bin/kafka-server-start.sh config/server-1.properties > bin/kafka-server-start.sh config/server-2.properties > bin/kafka-server-start.sh config/server-3.properties How it works Now the SNMB cluster is running. The brokers are running on the same Kafka node, on ports 9093, 9094, and 9095. SNMB – creating a topic, producer, and consumer The SNMB Kafka cluster is running; now let's create topics, producer, and consumer. Getting ready We need the previous recipe executed: Kafka already installed ZooKeeper up and running A Kafka server up and running Now, go to the Kafka installation directory (/usr/local/kafka/ for macOS users and /opt/kafka/ for Linux users): > cd /usr/local/kafka How to do it The following steps will show you how to create an SNMB topic, producer, and consumer Creating a topic Using the command to create topics, let's create a topic called SNMBTopic with two partitions and two replicas: > bin/kafka-topics.sh --create --zookeeper localhost:2181 -- replication-factor 2 --partitions 3 --topic SNMBTopic The following output is displayed: Created topic "SNMBTopic" This command has the following effects: Kafka will create three logical partitions for the topic. Kafka will create two replicas (copies) per partition. This means, for each partition it will pick two brokers that will host those replicas. For each partition, Kafka will randomly choose a broker Leader. Now ask Kafka for the list of available topics. The list now includes the new SNMBTopic: > bin/kafka-topics.sh --zookeeper localhost:2181 --list SNMBTopic Starting a producer Now, start the producers; indicating more brokers in the broker-list is easy: > bin/kafka-console-producer.sh --broker-list localhost:9093, localhost:9094, localhost:9095 --topic SNMBTopic If it's necessary to run multiple producers connecting to different brokers, specify a different broker list for each producer. Starting a consumer To start a consumer, use the following command: > bin/kafka-console-consumer.sh -- zookeeper localhost:2181 --frombeginning --topic SNMBTopic How it works The first important fact is the two parameters: replication-factor and partitions. The replication-factor is the number of replicas each partition will have in the topic created. The partitions parameter is the number of partitions for the topic created. There's more If you don't know the cluster configuration or don't remember it, there is a useful option for the kafka-topics command, the describe parameter: > bin/kafka-topics.sh --zookeeper localhost:2181 --describe --topic SNMBTopic The output is something similar to: Topic:SNMBTopic PartitionCount:3 ReplicationFactor:2 Configs: Topic: SNMBTopic Partition: 0 Leader: 2 Replicas: 2,3 Isr: 3,2 Topic: SNMBTopic Partition: 1 Leader: 3 Replicas: 3,1 Isr: 1,3 Topic: SNMBTopic Partition: 2 Leader: 1 Replicas: 1,2 Isr: 1,2 An explanation of the output: the first line gives a summary of all the partitions; each line gives information about one partition. Since we have three partitions for this topic, there are three lines: Leader: This node is responsible for all reads and writes for a particular partition. For a randomly selected section of the partitions each node is the leader. Replicas: This is the list of nodes that duplicate the log for a particular partition irrespective of whether it is currently alive. Isr: This is the set of in-sync replicas. It is a subset of the replicas currently alive and following the leader. In order to see the options for: create, delete, describe, or change a topic, type this command without parameters: > bin/kafka-topics.sh We discussed how to implement Apache Kafka clusters effectively. If you liked this post, be sure to check out Apache Kafka 1.0 Cookbook which consists of useful recipes to work with Apache Kafka installation.

0
0
3475

article-image-how-to-configure-metricbeat-for-application-and-server-infrastructure

Pravin Dhandre

23 Feb 2018

8 min read

How to Configure Metricbeat for Application and Server infrastructure

Pravin Dhandre

23 Feb 2018

8 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Pranav Shukla and Sharath Kumar M N titled Learning Elastic Stack 6.0. This book provides detailed understanding in how you can employ Elastic Stack in performing distributed analytics along with resolving various data processing challenges.[/box] In today’s tutorial, we will show the step-by-step configuration of Metricbeat, a Beats platform for monitoring server and application infrastructure. Configuring Metricbeat The configurations related to Metricbeat are stored in a configuration file named metricbeat.yml, and it uses YAML syntax. The metricbeat.yml file contains the following: Module configuration General settings Output configuration Processor configuration Path configuration Dashboard configuration Logging configuration Let's explore some of these sections. Module configuration Metricbeat comes bundled with various modules to collect metrics from the system and applications such as Apache, MongoDB, Redis, MySQL, and so on. Metricbeat provides two ways of enabling modules and metricsets: Enabling module configs in the modules.d directory Enabling module configs in the metricbeat.yml file Enabling module configs in the modules.d directory The modules.d directory contains default configurations for all the modules available in Metricbeat. The configuration specific to a module is stored in a .yml file with the name of the file being the name of the module. For example, the configuration related to the MySQL module would be stored in the mysql.yml file. By default, excepting the system module, all other modules are disabled. To list the modules that are available in Metricbeat, execute the following command: Windows: D:packtmetricbeat-6.0.0-windows-x86_64>metricbeat.exe modules list Linux: [locationOfMetricBeat]$./metricbeat modules list The modules list command displays all the available modules and also lists which modules are currently enabled/disabled. As each module comes with the default configurations, make the appropriate changes in the module configuration file. The basic configuration for mongodb module will look as follows: - module: mongodb metricsets: ["dbstats", "status"] period: 10s hosts: ["localhost:27017"] username: user password: pass To enable it, execute the modules enable command, passing one or more module name. For example: Windows: D:packtmetricbeat-6.0.0-windows-x86_64>metricbeat.exe modules enable redis mongodb Linux: [locationOfMetricBeat]$./metricbeat modules enable redis mongodb Similar to disable modules, execute the modules disable command, passing one or more module names to it. For example: Windows: D:packtmetricbeat-6.0.0-windows-x86_64>metricbeat.exe modules disable redis mongodb Linux: [locationOfMetricBeat]$./metricbeat modules disable redis mongodb To enable dynamic config reloading, set reload.enabled to true and to specify the frequency to look for config file changes. Set the reload.period parameter under the metricbeat.config.modules property. For example: #metricbeat.yml metricbeat.config.modules: path: ${path.config}/modules.d/*.yml reload.enabled: true reload.period: 20s Enabling module config in the metricbeat.yml file If one is used to earlier versions of Metricbeat, one can enable the modules and metricsets in the metricbeat.yml file directly by adding entries to the metricbeat.modules list. Each entry in the list begins with a dash (-) and is followed by the settings for that module. For Example: metricbeat.modules: #------------------ Memcached Module ----------------------------- - module: memcached metricsets: ["stats"] period: 10s hosts: ["localhost:11211"] #------------------- MongoDB Module ------------------------------ - module: mongodb metricsets: ["dbstats", "status"] period: 5s It is possible to specify the module multiple times and specify a different period to use for one or more metricset. For example: #------- Couchbase Module ----------------------------- - module: couchbase metricsets: ["bucket"] period: 15s hosts: ["localhost:8091"] - module: couchbase metricsets: ["cluster", "node"] period: 30s hosts: ["localhost:8091"] General settings This section contains configuration options and some general settings to control the behavior of Metricbeat. Some of the configuration options/settings are: name: The name of the shipper that publishes the network data. By default, hostname is used for this field: name: "dc1-host1" tags: The list of tags that will be included in the tags field of every event Metricbeat ships. Tags make it easy to group servers by different logical properties and help when filtering events in Kibana and Logstash: tags: ["staging", "web-tier","dc1"] max_procs: The maximum number of CPUs that can be executing simultaneously. The default is the number of logical CPUs available in the System: max_procs: 2 Output configuration This section is used to configure outputs where the events need to be shipped. Events can be sent to single or multiple outputs simultaneously. The allowed outputs are Elasticsearch, Logstash, Kafka, Redis, file, and console. Some of the outputs that can be configured are as follows: elasticsearch: It is used to send the events directly to Elasticsearch. A sample Elasticsearch output configuration is shown in the following code snippet: output.elasticsearch: enabled: true hosts: ["localhost:9200"] Using the enabled setting, one can enable or disable the output. hosts accepts one or more Elasticsearch node/server. Multiple hosts can be defined for failover purposes. When multiple hosts are configured, the events are distributed to these nodes in round robin order. If Elasticsearch is secured, then the credentials can be passed using the username and password settings: output.elasticsearch: enabled: true hosts: ["localhost:9200"] username: "elasticuser" password: "password" To ship the events to the Elasticsearch ingest node pipeline so that they can be pre-processed before being stored in Elasticsearch, the pipeline information can be provided using the pipleline setting: output.elasticsearch: enabled: true hosts: ["localhost:9200"] pipeline: "ngnix_log_pipeline" The default index the data gets written to is of the format metricbeat-%{[beat.version]}-%{+yyyy.MM.dd}. This will create a new index every day. For example if today is December 2, 2017 then all the events are placed in the metricbeat-6.0.0-2017-12-02 index. One can override the index name or the pattern using the index setting. In the following configuration snippet, a new index is created for every month: output.elasticsearch: hosts: ["http://localhost:9200"] index: "metricbeat-%{[beat.version]}-%{+yyyy.MM}" Using the indices setting, one can conditionally place the events in the appropriate index that matches the specified condition. In the following code snippet, if the message contains the DEBUG string, it will be placed in the debug-%{+yyyy.MM.dd} index. If the message contains the ERR string, it will be placed in the error-%{+yyyy.MM.dd} index. If the message contains neither of these texts, then those events will be pushed to the logs-%{+yyyy.MM.dd} index as specified in the index parameter: output.elasticsearch: hosts: ["http://localhost:9200"] index: "logs-%{+yyyy.MM.dd}" indices: - index: "debug-%{+yyyy.MM.dd}" when.contains: message: "DEBUG" - index: "error-%{+yyyy.MM.dd}" when.contains: message: "ERR" When the index parameter is overridden, disable templates and dashboards by adding the following setting in: setup.dashboards.enabled: false setup.template.enabled: false Alternatively, provide the value for setup.template.name and setup.template.pattern in the metricbeat.yml configuration file, or else Metricbeat will fail to run. logstash: It is used to send the events to Logstash. To use Logstash as the output, Logstash needs to be configured with the Beats input plugin to receive incoming Beats events. A sample Logstash output configuration is as follows: output.logstash: enabled: true hosts: ["localhost:5044"] Using the enabled setting, one can enable or disable the output. hosts accepts one or more Logstash servers. Multiple hosts can be defined for failover purposes. If the configured host is unresponsive, then the event will be sent to one of the other configured hosts. When multiple hosts are configured, the events are distributed in random order. To enable load balancing of events across the Logstash hosts, use the loadbalance flag, set to true: output.logstash: hosts: ["localhost:5045", "localhost:5046"] loadbalance: true console: It is used to send the events to stdout. The events are written in JSON format. It is useful during debugging or testing. A sample console configuration is as follows: output.console: enabled: true pretty: true Logging This section contains the options for configuring the Filebeat logging output. The logging system can write logs to syslog or rotate log files. If logging is not explicitly configured, file output is used on Windows systems, and syslog output is used on Linux and OS X. A sample configuration is as follows: logging.level: debug logging.to_files: true logging.files: path: C:logsmetricbeat name: metricbeat.log keepfiles: 10 Some of the configuration options are: level: To specify the logging level. to_files: To write all logging output to files. The files are subject to file rotation. This is the default value. to_syslog: To write the logging output to syslogs if this setting is set to true. files.path, files.name, and files.keepfiles: These are used to specify the location of the file, the name We successfully configured Beat Library, MetricBeat and developed good transmission of operational metrics to Elasticsearch, making it easy to monitor systems and services on servers with much ease. If you found this tutorial useful, do check out the book Learning Elastic Stack 6.0 to examine the fundamentals of Elastic Stack in detail and start developing solutions for problems like logging, site search, app search, metrics and more.

0
0
14609

How to perform regression analysis using SAS

How to win Kaggle competition with Apache SparkML

How to query sharded data in MongoDB

Performing descriptive analysis with SAS

What is ensemble learning?

How to execute jobs in an iterative way with Pentaho Data Integration (PDI)

Analyzing Textual Data using the NLTK Library

FAT* 2018 Conference Session 5 Summary on FAT Recommenders, Etc.

Getting Started with Pentaho Data Integration and Pentaho BI Suite

How to implement Dynamic SQL in PostgreSQL 10

Trending Topics

FAT Conference 2018 Session 4: Fair Classification

Working with pandas DataFrames

FAT Conference 2018 Session 3: Fairness in Computer Vision and NLP

Getting Started with Apache Kafka Clusters

How to Configure Metricbeat for Application and Server infrastructure