Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-entity-framework-code-first-accessing-database-views-and-stored-procedures

18 Mar 2015

15 min read

Entity Framework Code-First: Accessing Database Views and Stored Procedures

18 Mar 2015

0
0
9124

Robi Sen

16 Mar 2015

7 min read

Text Mining with R: Part 1

Robi Sen

16 Mar 2015

7 min read

R is rapidly becoming the platform of choice for programmers, scientists, and others who need to perform statistical analysis and data mining. In part this is because R is incredibly easy to learn and with just a few commands you can perform data mining and analysis functions that would be very hard in more general purpose languages like Ruby, .Net, Java, or C++. To demonstrate R’s ease, flexibility, and power we will look at how to use R to look at a collection of tweets from the 2014 super bowl, clear up data via R, turn that data it a document matrix so we can analyze the data, then create a “word cloud” so we can visualize our analysis to look for interesting words. Getting Started To get started you need to download both R and R studio. R can be found here and RStudio can be found here. R and RStudio are available for most major operating systems and you should follow the up to date installation guides on their respective websites. For this example we are going to be using a data set from Techtunk which is rather large. For this article I have taken a small excerpt of techtrunks SuperBowl 2014, over 8 million tweets, and cleaned it up for the article. You can download it from the original data source here. Finally you will need to install the R packages text mining package (tm ) and word cloud package (wordcloud). You can use standard library method to install the packages or just use RStudio to install the packets. Preparing our Data As already stated you can find the total SuperBowl 2014 dataset. That being said, it's very large and broken up into many sets of Pipe Delimited files, and they have the .csv file extension but are not .csv, which can be somewhat awkward to work with. This though is a common problem when working with large data sets. Luckily the data set is broken up into fragments in that usually when you are working with large datasets you do not want to try to start developing against the whole data set rather a small and workable dataset the will let you quickly develop your scripts without being so large and unwieldy that it delays development. Indeed you will find that working the large files provided by Techtunk can take 10’s of minutes to process as is. In cases like this is good to look at the data, figure out what data you want, take a sample set of data, massage it as needed, and then work in it from there until you have your coding working exactly how you want. In our cases I took a subset of 4600 tweets from one of the pipe delimited files, converted the file format to Commas Separated Value, .csv, and saved it as a sample file to work from. You can do the same thing, you should consider using files smaller than 5000 records, however you would like or use the file created for this post here. Visualizing our Data For this post all we want to do is get a general sense of what the more common words are that are being tweeted during the superbowl. A common way to visualize this is with a word cloud which will show us the frequency of a term by representing it in greater size to other words in comparison to how many times it is mentioned in the body of tweets being analyzed. To do this we need to a few things first with our data. First we need to create read in our file and turn our collection of tweets into a Corpus. In general a Corpus is a large body of text documents. In R’s textming package it’s an object that will be used to hold our tweets in memory. So to load our tweets as a corpus into R you can do as shown here: # change this file location to suit your machine file_loc <- "yourfilelocation/largerset11.csv" # change TRUE to FALSE if you have no column headings in the CSV myfile <- read.csv(file_loc, header = TRUE, stringsAsFactors=FALSE) require(tm) mycorpus <- Corpus(DataframeSource(myfile[c("username","tweet")])) You can now simply print your Corpus to get a sense of it. > print(mycorpus) <<VCorpus (documents: 4688, metadata (corpus/indexed): 0/0)>> In this case, VCorpus is an automatic assignment meaning that the Corpus is a Volatile object stored in memory. If you want you can make the Corpus permanent using PCorpus. You might do this if you were doing analysis on actual documents such as PDF’s or even databases and in this case R makes pointers to the documents instead of full document structures in memory. Another method you can use to look at your corpus is inspect() which provides a variety of ways to look at the documents in your corpus. For example using: inspect(mycorpus[1,2]) This might give you a result like: > inspect(mycorpus[1:2]) <<VCorpus (documents: 2, metadata (corpus/indexed): 0/0)>> [[1]] <<PlainTextDocument (metadata: 7)>> sstanley84 wow rt thestanchion news fleury just made fraudulent investment karlsson httptco5oi6iwashg [[2]] <<PlainTextDocument (metadata: 7)>> judemgreen 2 hour wait train superbowl time traffic problemsnice job chris christie As such inspect can be very useful in quickly getting a sense of data in your corpus without having to try to print the whole corpus. Now that we have our corpus in memory let's clean it up a little before we do our analysis. Usually you want to do this on documents that you are analyzing to remove words that are not relevant to your analysis such as “stopwords” or words such as and, like, but, if, the, and the like which you don’t care about. To do this with the textmining package you want to use transforms. Transforms essentially various functions to all documents in a corpus and that the form of tm_map(your corpus, some function). For example we can use tm_map like this: mycorpus <- tm_map(mycorpus, removePunctuation) Which will now remove all the punctuation marks from our tweets. We can do some other transforms to clean up our data by converting all the text to lower case, removing stop words, extra whitespace, and the like. mycorpus <- tm_map(mycorpus, removePunctuation) mycorpus <- tm_map(mycorpus, content_transformer(tolower)) mycorpus <- tm_map(mycorpus, stripWhitespace) mycorpus <- tm_map(mycorpus, removeWords, c(stopwords("english"), "news")) Note the last line. In that case we are using the stopwords() method but also adding our own word to it; news. You could append your own list of stopwords in this manner. Summary In this post we have looked at the basics of doing text mining in R by selecting data, preparing it, cleaning, then performing various operations on it to visualize that data. In the next post, Part 2, we look at a simple use case showing how we can derive real meaning and value from a visualization by seeing how a simple word cloud and help you understand the impact of an advertisement. About the author Robi Sen, CSO at Department 13, is an experienced inventor, serial entrepreneur, and futurist whose dynamic twenty-plus year career in technology, engineering, and research has led him to work on cutting edge projects for DARPA, TSWG, SOCOM, RRTO, NASA, DOE, and the DOD. Robi also has extensive experience in the commercial space, including the co-creation of several successful start-up companies. He has worked with companies such as UnderArmour, Sony, CISCO, IBM, and many others to help build out new products and services. Robi specializes in bringing his unique vision and thought process to difficult and complex problems allowing companies and organizations to find innovative solutions that they can rapidly operationalize or go to market with.

0
0
2676

article-image-pricing-double-no-touch-option

Packt

10 Mar 2015

19 min read

Pricing the Double-no-touch option

Packt

10 Mar 2015

19 min read

In this article by Balázs Márkus, coauthor of the book Mastering R for Quantitative Finance, you will learn about pricing and life of Double-no-touch (DNT) option. (For more resources related to this topic, see here.) A Double-no-touch (DNT) option is a binary option that pays a fixed amount of cash at expiry. Unfortunately, the fExoticOptions package does not contain a formula for this option at present. We will show two different ways to price DNTs that incorporate two different pricing approaches. In this section, we will call the function dnt1, and for the second approach, we will use dnt2 as the name for the function. Hui (1996) showed how a one-touch double barrier binary option can be priced. In his terminology, "one-touch" means that a single trade is enough to trigger the knock-out event, and "double barrier" binary means that there are two barriers and this is a binary option. We call this DNT as it is commonly used on the FX markets. This is a good example for the fact that many popular exotic options are running under more than one name. In Haug (2007a), the Hui-formula is already translated into the generalized framework. S, r, b, s, and T have the same meaning. K means the payout (dollar amount) while L and U are the lower and upper barriers. Where Implementing the Hui (1996) function to R starts with a big question mark: what should we do with an infinite sum? How high a number should we substitute as infinity? Interestingly, for practical purposes, small number like 5 or 10 could often play the role of infinity rather well. Hui (1996) states that convergence is fast most of the time. We are a bit skeptical about this since a will be used as an exponent. If b is negative and sigma is small enough, the (S/L)a part in the formula could turn out to be a problem. First, we will try with normal parameters and see how quick the convergence is: dnt1 <- function(S, K, U, L, sigma, T, r, b, N = 20, ploterror = FALSE){ if ( L > S | S > U) return(0) Z <- log(U/L) alpha <- -1/2*(2*b/sigma^2 - 1) beta <- -1/4*(2*b/sigma^2 - 1)^2 - 2*r/sigma^2 v <- rep(0, N) for (i in 1:N) v[i] <- 2*pi*i*K/(Z^2) * (((S/L)^alpha - (-1)^i*(S/U)^alpha ) / (alpha^2+(i*pi/Z)^2)) * sin(i*pi/Z*log(S/L)) * exp(-1/2 * ((i*pi/Z)^2-beta) * sigma^2*T) if (ploterror) barplot(v, main = "Formula Error"); sum(v) } print(dnt1(100, 10, 120, 80, 0.1, 0.25, 0.05, 0.03, 20, TRUE)) The following screenshot shows the result of the preceding code: The Formula Error chart shows that after the seventh step, additional steps were not influencing the result. This means that for practical purposes, the infinite sum can be quickly estimated by calculating only the first seven steps. This looks like a very quick convergence indeed. However, this could be pure luck or coincidence. What about decreasing the volatility down to 3 percent? We have to set N as 50 to see the convergence: print(dnt1(100, 10, 120, 80, 0.03, 0.25, 0.05, 0.03, 50, TRUE)) The preceding command gives the following output: Not so impressive? 50 steps are still not that bad. What about decreasing the volatility even lower? At 1 percent, the formula with these parameters simply blows up. First, this looks catastrophic; however, the price of a DNT was already 98.75 percent of the payout when we used 3 percent volatility. Logic says that the DNT price should be a monotone-decreasing function of volatility, so we already know that the price of the DNT should be worth at least 98.75 percent if volatility is below 3 percent. Another issue is that if we choose an extreme high U or extreme low L, calculation errors emerge. However, similar to the problem with volatility, common sense helps here too; the price of a DNT should increase if we make U higher or L lower. There is still another trick. Since all the problem comes from the a parameter, we can try setting b as 0, which will make a equal to 0.5. If we also set r to 0, the price of a DNT converges into 100 percent as the volatility drops. Anyway, whenever we substitute an infinite sum by a finite sum, it is always good to know when it will work and when it will not. We made a new code that takes into consideration that convergence is not always quick. The trick is that the function calculates the next step as long as the last step made any significant change. This is still not good for all the parameters as there is no cure for very low volatility, except that we accept the fact that if implied volatilities are below 1 percent, than this is an extreme market situation in which case DNT options should not be priced by this formula: dnt1 <- function(S, K, U, L, sigma, Time, r, b) { if ( L > S | S > U) return(0) Z <- log(U/L) alpha <- -1/2*(2*b/sigma^2 - 1) beta <- -1/4*(2*b/sigma^2 - 1)^2 - 2*r/sigma^2 p <- 0 i <- a <- 1 while (abs(a) > 0.0001){ a <- 2*pi*i*K/(Z^2) * (((S/L)^alpha - (-1)^i*(S/U)^alpha ) / (alpha^2 + (i *pi / Z)^2) ) * sin(i * pi / Z * log(S/L)) * exp(-1/2*((i*pi/Z)^2-beta) * sigma^2 * Time) p <- p + a i <- i + 1 } p } Now that we have a nice formula, it is possible to draw some DNT-related charts to get more familiar with this option. Later, we will use a particular AUDUSD DNT option with the following parameters: L equal to 0.9200, U equal to 0.9600, K (payout) equal to USD 1 million, T equal to 0.25 years, volatility equal to 6 percent, r_AUD equal to 2.75 percent, r_USD equal to 0.25 percent, and b equal to -2.5 percent. We will calculate and plot all the possible values of this DNT from 0.9200 to 0.9600; each step will be one pip (0.0001), so we will use 2,000 steps. The following code plots a graph of price of underlying: x <- seq(0.92, 0.96, length = 2000) y <- z <- rep(0, 2000) for (i in 1:2000){ y[i] <- dnt1(x[i], 1e6, 0.96, 0.92, 0.06, 0.25, 0.0025, -0.0250) z[i] <- dnt1(x[i], 1e6, 0.96, 0.92, 0.065, 0.25, 0.0025, -0.0250) } matplot(x, cbind(y,z), type = "l", lwd = 2, lty = 1, main = "Price of a DNT with volatility 6% and 6.5% ", cex.main = 0.8, xlab = "Price of underlying" ) The following output is the result of the preceding code: It can be clearly seen that even a small change in volatility can have a huge impact on the price of a DNT. Looking at this chart is an intuitive way to find that vega must be negative. Interestingly enough even just taking a quick look at this chart can convince us that the absolute value of vega is decreasing if we are getting closer to the barriers. Most end users think that the biggest risk is when the spot is getting close to the trigger. This is because end users really think about binary options in a binary way. As long as the DNT is alive, they focus on the positive outcome. However, for a dynamic hedger, the risk of a DNT is not that interesting when the value of the DNT is already small. It is also very interesting that since the T-Bill price is independent of the volatility and since the DNT + DOT = T-Bill equation holds, an increasing volatility will decrease the price of the DNT by the exact same amount just like it will increase the price of the DOT. It is not surprising that the vega of the DOT should be the exact mirror of the DNT. We can use the GetGreeks function to estimate vega, gamma, delta, and theta. For gamma we can use the GetGreeks function in the following way: GetGreeks <- function(FUN, arg, epsilon,...) { all_args1 <- all_args2 <- list(...) all_args1[[arg]] <- as.numeric(all_args1[[arg]] + epsilon) all_args2[[arg]] <- as.numeric(all_args2[[arg]] - epsilon) (do.call(FUN, all_args1) - do.call(FUN, all_args2)) / (2 * epsilon) } Gamma <- function(FUN, epsilon, S, ...) { arg1 <- list(S, ...) arg2 <- list(S + 2 * epsilon, ...) arg3 <- list(S - 2 * epsilon, ...) y1 <- (do.call(FUN, arg2) - do.call(FUN, arg1)) / (2 * epsilon) y2 <- (do.call(FUN, arg1) - do.call(FUN, arg3)) / (2 * epsilon) (y1 - y2) / (2 * epsilon) } x = seq(0.9202, 0.9598, length = 200) delta <- vega <- theta <- gamma <- rep(0, 200) for(i in 1:200){ delta[i] <- GetGreeks(FUN = dnt1, arg = 1, epsilon = 0.0001, x[i], 1000000, 0.96, 0.92, 0.06, 0.5, 0.02, -0.02) vega[i] <- GetGreeks(FUN = dnt1, arg = 5, epsilon = 0.0005, x[i], 1000000, 0.96, 0.92, 0.06, 0.5, 0.0025, -0.025) theta[i] <- - GetGreeks(FUN = dnt1, arg = 6, epsilon = 1/365, x[i], 1000000, 0.96, 0.92, 0.06, 0.5, 0.0025, -0.025) gamma[i] <- Gamma(FUN = dnt1, epsilon = 0.0001, S = x[i], K = 1e6, U = 0.96, L = 0.92, sigma = 0.06, Time = 0.5, r = 0.02, b = -0.02) } windows() plot(x, vega, type = "l", xlab = "S",ylab = "", main = "Vega") The following chart is the result of the preceding code: After having a look at the value chart, the delta of a DNT is also very close to intuitions; if we are coming close to the higher barrier, our delta gets negative, and if we are coming closer to the lower barrier, the delta gets positive as follows: windows() plot(x, delta, type = "l", xlab = "S",ylab = "", main = "Delta") This is really a non-convex situation; if we would like to do a dynamic delta hedge, we will lose money for sure. If the spot price goes up, the delta of the DNT decreases, so we should buy some AUDUSD as a hedge. However, if the spot price goes down, we should sell some AUDUSD. Imagine a scenario where AUDUSD goes up 20 pips in the morning and then goes down 20 pips in the afternoon. For a dynamic hedger, this means buying some AUDUSD after the price moved up and selling this very same amount after the price comes down. The changing of the delta can be described by the gamma as follows: windows() plot(x, gamma, type = "l", xlab = "S",ylab = "", main = "Gamma") Negative gamma means that if the spot goes up, our delta is decreasing, but if the spot goes down, our delta is increasing. This doesn't sound great. For this inconvenient non-convex situation, there is some compensation, that is, the value of theta is positive. If nothing happens, but one day passes, the DNT will automatically worth more. Here, we use theta as minus 1 times the partial derivative, since if (T-t) is the time left, we check how the value changes as t increases by one day: windows() plot(x, theta, type = "l", xlab = "S",ylab = "", main = "Theta") The more negative the gamma, the more positive our theta. This is how time compensates for the potential losses generated by the negative gamma. Risk-neutral pricing also implicates that negative gamma should be compensated by a positive theta. This is the main message of the Black-Scholes framework for vanilla options, but this is also true for exotics. See Taleb (1997) and Wilmott (2006). We already introduced the Black-Scholes surface before; now, we can go into more detail. This surface is also a nice interpretation of how theta and delta work. It shows the price of an option for different spot prices and times to maturity, so the slope of this surface is the theta for one direction and delta for the other. The code for this is as follows: BS_surf <- function(S, Time, FUN, ...) { n <- length(S) k <- length(Time) m <- matrix(0, n, k) for (i in 1:n) { for (j in 1:k) { l <- list(S = S[i], Time = Time[j], ...) m[i,j] <- do.call(FUN, l) } } persp3D(z = m, xlab = "underlying", ylab = "Time", zlab = "option price", phi = 30, theta = 30, bty = "b2") } BS_surf(seq(0.92,0.96,length = 200), seq(1/365, 1/48, length = 200), dnt1, K = 1000000, U = 0.96, L = 0.92, r = 0.0025, b = -0.0250, sigma = 0.2) The preceding code gives the following output: We can see what was already suspected; DNT likes when time is passing and the spot is moving to the middle of the (L,U) interval. Another way to price the Double-no-touch option Static replication is always the most elegant way of pricing. The no-arbitrage argument will let us say that if, at some time in the future, two portfolios have the same value for sure, then their price should be equal any time before this. We will show how double-knock-out (DKO) options could be used to build a DNT. We will need to use a trick; the strike price could be the same as one of the barriers. For a DKO call, the strike price should be lower than the upper barrier because if the strike price is not lower than the upper barrier, the DKO call would be knocked out before it could become in-the-money, so in this case, the option would be worthless as nobody can ever exercise it in-the-money. However, we can choose the strike price to be equal to the lower barrier. For a put, the strike price should be higher than the lower barrier, so why not make it equal to the upper barrier. This way, the DKO call and DKO put option will have a very convenient feature; if they are still alive, they will both expiry in-the-money. Now, we are almost done. We just have to add the DKO prices, and we will get a DNT that has a payout of (U-L) dollars. Since DNT prices are linear in the payout, we only have to multiply the result by K*(U-L): dnt2 <- function(S, K, U, L, sigma, T, r, b) { a <- DoubleBarrierOption("co", S, L, L, U, T, r, b, sigma, 0, 0,title = NULL, description = NULL) z <- a@price b <- DoubleBarrierOption("po", S, U, L, U, T, r, b, sigma, 0, 0,title = NULL, description = NULL) y <- b@price (z + y) / (U - L) * K } Now, we have two functions for which we can compare the results: dnt1(0.9266, 1000000, 0.9600, 0.9200, 0.06, 0.25, 0.0025, -0.025) [1] 48564.59 dnt2(0.9266, 1000000, 0.9600, 0.9200, 0.06, 0.25, 0.0025, -0.025) [1] 48564.45 For a DNT with a USD 1 million contingent payout and an initial market value of over 48,000 dollars, it is very nice to see that the difference in the prices is only 14 cents. Technically, however, having a second pricing function is not a big help since low volatility is also an issue for dnt2. We will use dnt1 for the rest of the article. The life of a Double-no-touch option – a simulation How has the DNT price been evolving during the second quarter of 2014? We have the open-high-low-close type time series with five minute frequency for AUDUSD, so we know all the extreme prices: d <- read.table("audusd.csv", colClasses = c("character", rep("numeric",5)), sep = ";", header = TRUE) underlying <- as.vector(t(d[, 2:5])) t <- rep( d[,6], each = 4) n <- length(t) option_price <- rep(0, n) for (i in 1:n) { option_price[i] <- dnt1(S = underlying[i], K = 1000000, U = 0.9600, L = 0.9200, sigma = 0.06, T = t[i]/(60*24*365), r = 0.0025, b = -0.0250) } a <- min(option_price) b <- max(option_price) option_price_transformed = (option_price - a) * 0.03 / (b - a) + 0.92 par(mar = c(6, 3, 3, 5)) matplot(cbind(underlying,option_price_transformed), type = "l", lty = 1, col = c("grey", "red"), main = "Price of underlying and DNT", xaxt = "n", yaxt = "n", ylim = c(0.91,0.97), ylab = "", xlab = "Remaining time") abline(h = c(0.92, 0.96), col = "green") axis(side = 2, at = pretty(option_price_transformed), col.axis = "grey", col = "grey") axis(side = 4, at = pretty(option_price_transformed), labels = round(seq(a/1000,1000,length = 7)), las = 2, col = "red", col.axis = "red") axis(side = 1, at = seq(1,n, length=6), labels = round(t[round(seq(1,n, length=6))]/60/24)) The following is the output for the preceding code: The price of a DNT is shown in red on the right axis (divided by 1000), and the actual AUDUSD price is shown in grey on the left axis. The green lines are the barriers of 0.9200 and 0.9600. The chart shows that in 2014 Q2, the AUDUSD currency pair was traded inside the (0.9200; 0.9600) interval; thus, the payout of the DNT would have been USD 1 million. This DNT looks like a very good investment; however, reality is just one trajectory out of an a priori almost infinite set. It could have happened differently. For example, on May 02, 2014, there were still 59 days left until expiry, and AUDUSD was traded at 0.9203, just three pips away from the lower barrier. At this point, the price of this DNT was only USD 5,302 dollars which is shown in the following code: dnt1(0.9203, 1000000, 0.9600, 0.9200, 0.06, 59/365, 0.0025, -0.025) [1] 5302.213 Compare this USD 5,302 to the initial USD 48,564 option price! In the following simulation, we will show some different trajectories. All of them start from the same 0.9266 AUDUSD spot price as it was on the dawn of April 01, and we will see how many of them stayed inside the (0.9200; 0.9600) interval. To make it simple, we will simulate geometric Brown motions by using the same 6 percent volatility as we used to price the DNT: library(matrixStats) DNT_sim <- function(S0 = 0.9266, mu = 0, sigma = 0.06, U = 0.96, L = 0.92, N = 5) { dt <- 5 / (365 * 24 * 60) t <- seq(0, 0.25, by = dt) Time <- length(t) W <- matrix(rnorm((Time - 1) * N), Time - 1, N) W <- apply(W, 2, cumsum) W <- sqrt(dt) * rbind(rep(0, N), W) S <- S0 * exp((mu - sigma^2 / 2) * t + sigma * W ) option_price <- matrix(0, Time, N) for (i in 1:N) for (j in 1:Time) option_price[j,i] <- dnt1(S[j,i], K = 1000000, U, L, sigma, 0.25-t[j], r = 0.0025, b = -0.0250)*(min(S[1:j,i]) > L & max(S[1:j,i]) < U) survivals <- sum(option_price[Time,] > 0) dev.new(width = 19, height = 10) par(mfrow = c(1,2)) matplot(t,S, type = "l", main = "Underlying price", xlab = paste("Survived", survivals, "from", N), ylab = "") abline( h = c(U,L), col = "blue") matplot(t, option_price, type = "l", main = "DNT price", xlab = "", ylab = "")} set.seed(214) system.time(DNT_sim()) The following is the output for the preceding code: Here, the only surviving trajectory is the red one; in all other cases, the DNT hits either the higher or the lower barrier. The line set.seed(214) grants that this simulation will look the same anytime we run this. One out of five is still not that bad; it would suggest that for an end user or gambler who does no dynamic hedging, this option has an approximate value of 20 percent of the payout (especially since the interest rates are low, the time value of money is not important). However, five trajectories are still too few to jump to such conclusions. We should check the DNT survivorship ratio for a much higher number of trajectories. The ratio of the surviving trajectories could be a good estimator of the a priori real-world survivorship probability of this DNT; thus, the end user value of it. Before increasing N rapidly, we should keep in mind how much time this simulation took. For my computer, it took 50.75 seconds for N = 5, and 153.11 seconds for N = 15. The following is the output for N = 15: Now, 3 out of 15 survived, so the estimated survivorship ratio is still 3/15, which is equal to 20 percent. Looks like this is a very nice product; the price is around 5 percent of the payout, while 20 percent is the estimated survivorship ratio. Just out of curiosity, run the simulation for N equal to 200. This should take about 30 minutes. The following is the output for N = 200: The results are shocking; now, only 12 out of 200 survive, and the ratio is only 6 percent! So to get a better picture, we should run the simulation for a larger N. The movie Whatever Works by Woody Allen (starring Larry David) is 92 minutes long; in simulation time, that is N = 541. For this N = 541, there are only 38 surviving trajectories, resulting in a survivorship ratio of 7 percent. What is the real expected survivorship ratio? Is it 20 percent, 6 percent, or 7 percent? We simply don't know at this point. Mathematicians warn us that the law of large numbers requires large numbers, where large is much more than 541, so it would be advisable to run this simulation for as large an N as time allows. Of course, getting a better computer also helps to do more N during the same time. Anyway, from this point of view, Hui's (1996) relatively fast converging DNT pricing formula gets some respect. Summary We started this article by introducing exotic options. In a brief theoretical summary, we explained how exotics are linked together. There are many types of exotics. We showed one possible way of classification that is consistent with the fExoticOptions package. We showed how the Black-Scholes surface (a 3D chart that contains the price of a derivative dependent on time and the underlying price) can be constructed for any pricing function. Resources for Article: Further resources on this subject: What is Quantitative Finance? [article] Learning Option Pricing [article] Derivatives Pricing [article]

0
0
5727

article-image-learning-random-forest-using-mahout

Packt

05 Mar 2015

11 min read

Learning Random Forest Using Mahout

Packt

05 Mar 2015

11 min read

0
1
3472

Packt

05 Mar 2015

43 min read

Hadoop and MapReduce

Packt

05 Mar 2015

43 min read

0
0
5657

Packt

04 Mar 2015

20 min read

Writing Consumers

Packt

04 Mar 2015

20 min read

0
0
2825

Packt

04 Mar 2015

10 min read

Deployment Scenarios

Packt

04 Mar 2015

10 min read

In this article by Andrea Gazzarini, author of the book Apache Solr Essentials, contains information on the various ways in which you can deploy Solr, including key features and pros and cons for each scenario. Solr has a wide range of deployment alternatives, from monolithic to distributed indexes and standalone to clustered instances. We will organize this article by deployment scenarios, with a growing level of complexity. This article will cover the following topics: Sharding Replication: master, slave, and repeaters (For more resources related to this topic, see here.) Standalone instance All the examples use a standalone instance of Solr, that is, one or more cores managed by a Solr deployment hosted in a standalone servlet container (for example, Jetty, Tomcat, and so on). This kind of deployment is useful for development because, as you learned, it is very easy to start and debug. Besides, it can also be suitable for a production context if you don't have strict non-functional requirements and have a small or medium amount of data. I have used a standalone instance to provide autocomplete services for small and medium intranet systems. Anyway, the main features of this kind of deployment are simplicity and maintainability; one simple node acts as both an indexer and a searcher. The following diagram depicts a standalone instance with two cores: Shards When a monolithic index becomes too large for a single node or when additions, deletions, or queries take too long to execute, the index can be split into multiple pieces called shards. The previous sentence highlights a logical and theoretical evolution path of a Solr index. However, this (in general) is valid for all scenarios we will describe. It is strongly recommended that you perform a preliminary analysis of your data and the estimated growth factor in order to decide from the beginning the right configuration that suits your requirements. Although it is possible to split an existing index into shards (https://lucene.apache.org/core/4_10_3/misc/org/apache/lucene/index/PKIndexSplitter.html), things definitely become easier if you start directly with a distributed index (if you need it, of course). The index is split vertically so that each shard contains a disjoint set of the entire index. Solr will query and merge results across those shards. The following diagram illustrates a Solr deployment with 3 nodes; this deployment consists of two cores (C1 and C2) divided into three shards (S1, S2, and S3): When using shards, only query requests are distributed. This means that it's up to the indexer to add and distribute the data across nodes, and to subsequently forward a change request (that is, delete, replace, and commit) for a given document to the appropriate shard (the shard that owns the document). The Solr Wiki recommends a simple, hash-based algorithm to determine the shard where a given document should be indexed: documentId.hashCode() % numServers Using this approach is also useful in order to know in advance where to send delete or update requests for a given document. On the opposite side, a searcher client will send a query request to any node, but it has to specify an additional shards parameter that declares the target shards that will be queried. In the following example, assuming that two shards are hosted in two servers listening to ports 8080 and 8081, the same request when sent to both nodes will produce the same result: http://localhost:8080/solr/c1/query?q=*:*&shards=localhost:8080/solr/c1,localhost:8081/solr/c2 http://localhost:8081/solr/c2/query?q=*:*&shards=localhost:8080/solr/c1,localhost:8081/solr/c2 When sending a query request, a client can optionally include a pseudofield associated with the [shard] transformer. In this case, as a part of each returned document, there will be additional information indicating the owning shard. This is an example of such a request: http://localhost:8080/solr/c1/query?q=*:*&shards=localhost:8080/solr/c1,localhost:8081/solr/c2&src_shard:[shard] Here is the corresponding response (note the pseudofield aliased as src_shard): <result name="response" numFound="192" start="0"> <doc> <str name="id">9920</str> <str name="brand">Fender</str> <str name="model">Jazz Bass</str> <arr name="artist"> <str>Marcus Miller</str> </arr><str name="series">Marcus Miller signature</str> <str name="src_shard">localhost:8080/solr/shard1</str> </doc> … <doc> <str name="id">4392</str> <str name="brand">Music Man</str> <str name="model">Sting Ray</str> <arr name="artist"><str>Tony Levin</str></arr> <str name="series">5 strings DeLuxe</str> <str name="src_shard">localhost:8081/solr/shard2</str> </doc> </result> The following are a few things to keep in mind when using this deployment scenario: The schema must have a uniqueKey field. This field must be declared as stored and indexed; in addition, it is supposed to be unique across all shards. Inverse Document Frequency (IDF) calculations cannot be distributed. IDF is computed per shard. Joins between documents belonging to different shards are not supported. If a shard receives both index and query requests, the index may change during a query execution, thus compromising the outgoing results (for example, a matching document that has been deleted). Master/slaves scenario In a master/slaves scenario, there are two types of Solr servers: an indexer (the master) and one or more searchers (the slaves). The master is the server that manages the index. It receives update requests and applies those changes. A searcher, on the other hand, is a Solr server that exposes search services to external clients. The index, in terms of data files, is replicated from the indexer to the searcher through HTTP by means of a built-in RequestHandler that must be configured on both the indexer side and searcher side (within the solrconfig.xml configuration file). On the indexer (master), a replication configuration looks like this: <requestHandler name="/replication" class="solr.ReplicationHandler"> <lst name="master"> <str name="replicateAfter">startup</str> <str name="replicateAfter">optimize</str> <str name="confFiles">schema.xml,stopwords.txt</str> </lst> </requestHandler> The replication mechanism can be configured to be triggered after one of the following events: Commit: A commit has been applied Optimize: The index has been optimized Startup: The Solr instance has started In the preceding example, we want the index to be replicated after startup and optimize commands. Using the confFiles parameter, we can also indicate a set of configuration files (schema.xml and stopwords.txt, in the example) that must be replicated together with the index. Remember that changes on those files don't trigger any replication. Only a change in the index, in conjunction with one of the events we defined in the replicateAfter parameter, will mark the index (and the configuration files) as replicable. On the searcher side, the configuration looks like the following: <requestHandler name="/replication" class="solr.ReplicationHandler"> <lst name="slave"> <str name="masterUrl">http://<localhost>:<port>/solrmaster</str> <str name="pollInterval">00:00:10</str> </lst> </requestHandler> You can see that a searcher periodically keeps polling the master (the pollInterval parameter) to check whether a newer version of the index is available. If it is, the searcher will start the replication mechanism by issuing a request to the master, which is completely unaware of the searchers. The replicability status of the index is actually indicated by a version number. If the searcher has the same version as the master, it means the index is the same. If the versions are different, it means that a newer version of the index is available on the master, and replication can start. Other than separating responsibilities, this deployment configuration allows us to have a so-called diamond architecture, consisting of one indexer and several searchers. When the replication is triggered, each searcher in the ring will receive a whole copy of the index. This allows the following: Load balancing of the incoming (query) requests. An increment to the availability of the whole system. In the event of a server crash, the other searchers will continue to serve the incoming requests. The following diagram illustrates a master/slave deployment scenario with one indexer, three searchers, and two cores: If the searchers are in several geographically dislocated data centers, an additional role called repeater can be configured in each data center in order to rationalize the replication data traffic flow between nodes. A repeater is simply a node that acts as both a master and a slave. It is a slave of the main master, and at the same time, it acts as master of the searchers within the same data center, as shown in this diagram: Shards with replication This scenario combines shards and replication in order to have a scalable system with high throughput and availability. There is one indexer and one or more searchers for each shard, allowing load balancing between (query) shard requests. The following diagram illustrates a scenario with two cores, three shards, one indexer, and (due to problems with available space), only one searcher for each shard: The drawback of this approach is undoubtedly the overall growing complexity of the system that requires more effort in terms of maintainability, manageability, and system administration. In addition to this, each searcher is an independent node, and we don't have a central administration console where a system administrator can get a quick overview of system health. Summary In this article, we described various ways in which you can deploy Solr. Each deployment scenario has specific features, advantages, and drawbacks that make a choice ideal for one context and bad for another. A good thing is that the different scenarios are not strictly exclusive; they follow an incremental approach. In an ideal context, things should start immediately with the perfect scenario that fits your needs. However, unless your requirements are clear right from the start, you can begin with a simple configuration and then change it, depending on how your application evolves. Resources for Article: Further resources on this subject: Tuning Solr JVM and Container [article] Boost Your search [article] In the Cloud [article]

0
0
1074

article-image-elasticsearch-administration

Packt

03 Mar 2015

28 min read

Elasticsearch Administration

Packt

03 Mar 2015

28 min read

0
0
2596

Packt

03 Mar 2015

14 min read

SciPy for Signal Processing

Packt

03 Mar 2015

14 min read

In this article by Sergio J. Rojas G. and Erik A Christensen, authors of the book Learning SciPy for Numerical and Scientific Computing - Second Edition, we will focus on the usage of some most commonly used routines that are included in SciPy modules—scipy.signal, scipy.ndimage, and scipy.fftpack, which are used for signal processing, multidimensional image processing, and computing Fourier transforms, respectively. We define a signal as data that measures either a time-varying or spatially varying phenomena. Sound or electrocardiograms are excellent examples of time-varying quantities, while images embody the quintessential spatially varying cases. Moving images are treated with the techniques of both types of signals, obviously. The field of signal processing treats four aspects of this kind of data: its acquisition, quality improvement, compression, and feature extraction. SciPy has many routines to treat effectively tasks in any of the four fields. All these are included in two low-level modules (scipy.signal being the main module, with an emphasis on time-varying data, and scipy.ndimage, for images). Many of the routines in these two modules are based on Discrete Fourier Transform of the data. SciPy has an extensive package of applications and definitions of these background algorithms, scipy.fftpack, which we will start covering first. (For more resources related to this topic, see here.) Discrete Fourier Transforms The Discrete Fourier Transform (DFT from now on) transforms any signal from its time/space domain into a related signal in the frequency domain. This allows us not only to be able to analyze the different frequencies of the data, but also for faster filtering operations, when used properly. It is possible to turn a signal in the frequency domain back to its time/spatial domain; thanks to the Inverse Fourier Transform. We will not go into detail of the mathematics behind these operators, since we assume familiarity at some level with this theory. We will focus on syntax and applications instead. The basic routines in the scipy.fftpack module compute the DFT and its inverse, for discrete signals in any dimension, which are fft and ifft (one dimension), fft2 and ifft2 (two dimensions), and fftn and ifftn (any number of dimensions). All of these routines assume that the data is complex valued. If we know beforehand that a particular dataset is actually real valued, and should offer real-valued frequencies, we use rfft and irfft instead, for a faster algorithm. All these routines are designed so that composition with their inverses always yields the identity. The syntax is the same in all cases, as follows: fft(x[, n, axis, overwrite_x]) The first parameter, x, is always the signal in any array-like form. Note that fft performs one-dimensional transforms. This means in particular, that if x happens to be two-dimensional, for example, fft will output another two-dimensional array where each row is the transform of each row of the original. We can change it to columns instead, with the optional parameter, axis. The rest of parameters are also optional; n indicates the length of the transform, and overwrite_x gets rid of the original data to save memory and resources. We usually play with the integer n when we need to pad the signal with zeros, or truncate it. For higher dimension, n is substituted by shape (a tuple), and axis by axes (another tuple). To better understand the output, it is often useful to shift the zero frequencies to the center of the output arrays with fftshift. The inverse of this operation, ifftshift, is also included in the module. The following code shows some of these routines in action, when applied to a checkerboard image: >>> import numpy >>> from scipy.fftpack import fft,fft2, fftshift >>> import matplotlib.pyplot as plt >>> B=numpy.ones((4,4)); W=numpy.zeros((4,4)) >>> signal = numpy.bmat("B,W;W,B") >>> onedimfft = fft(signal,n=16) >>> twodimfft = fft2(signal,shape=(16,16)) >>> plt.figure() >>> plt.gray() >>> plt.subplot(121,aspect='equal') >>> plt.pcolormesh(onedimfft.real) >>> plt.colorbar(orientation='horizontal') >>> plt.subplot(122,aspect='equal') >>> plt.pcolormesh(fftshift(twodimfft.real)) >>> plt.colorbar(orientation='horizontal') >>> plt.show() Note how the first four rows of the one-dimensional transform are equal (and so are the last four), while the two-dimensional transform (once shifted) presents a peak at the origin, and nice symmetries in the frequency domain. In the following screenshot (obtained from the preceding code), the left-hand side image is fft and the right-hand side image is fft2 of a 2 x 2 checkerboard signal: The scipy.fftpack module also offers the Discrete Cosine Transform with its inverse (dct, idct) as well as many differential and pseudo-differential operators defined in terms of all these transforms: diff (for derivative/integral), hilbert and ihilbert (for the Hilbert transform), tilbert and itilbert (for the h-Tilbert transform of periodic sequences), and so on. Signal construction To aid in the construction of signals with predetermined properties, the scipy.signal module has a nice collection of the most frequent one-dimensional waveforms in the literature: chirp and sweep_poly (for the frequency-swept cosine generator), gausspulse (a Gaussian modulated sinusoid) and sawtooth and square (for the waveforms with those names). They all take as their main parameter a one-dimensional ndarray representing the times at which the signal is to be evaluated. Other parameters control the design of the signal, according to frequency or time constraints. Let's take a look into the following code snippet, which illustrates the use of these one dimensional waveforms that we just discussed: >>> import numpy >>> from scipy.signal import chirp, sawtooth, square, gausspulse >>> import matplotlib.pyplot as plt >>> t=numpy.linspace(-1,1,1000) >>> plt.subplot(221); plt.ylim([-2,2]) >>> plt.plot(t,chirp(t,f0=100,t1=0.5,f1=200)) # plot a chirp >>> plt.subplot(222); plt.ylim([-2,2]) >>> plt.plot(t,gausspulse(t,fc=10,bw=0.5)) # Gauss pulse >>> plt.subplot(223); plt.ylim([-2,2]) >>> t*=3*numpy.pi >>> plt.plot(t,sawtooth(t)) # sawtooth >>> plt.subplot(224); plt.ylim([-2,2]) >>> plt.plot(t,square(t)) # Square wave >>> plt.show() Generated by this code, the following diagram shows waveforms for chirp (upper-left), gausspulse (upper-right), sawtooth (lower-left), and square (lower-right): The usual method of creating signals is to import them from the file. This is possible by using purely NumPy routines, for example fromfile: fromfile(file, dtype=float, count=-1, sep='') The file argument may point to either a file or a string, the count argument is used to determine the number of items to read, and sep indicates what constitutes a separator in the original file/string. For images, we have the versatile routine, imread in either the scipy.ndimage or scipy.misc module: imread(fname, flatten=False) The fname argument is a string containing the location of an image. The routine infers the type of file, and reads the data into an array, accordingly. In case the flatten argument is turned to True, the image is converted to gray scale. Note that, in order to work, the Python Imaging Library (PIL) needs to be installed. It is also possible to load .wav files for analysis, with the read and write routines from the wavfile submodule in the scipy.io module. For instance, given any audio file with this format, say audio.wav, the command, rate,data = scipy.io.wavfile.read("audio.wav"), assigns an integer value to the rate variable, indicating the sample rate of the file (in samples per second), and a NumPy ndarray to the data variable, containing the numerical values assigned to the different notes. If we wish to write some one-dimensional ndarray data into an audio file of this kind, with the sample rate given by the rate variable, we may do so by issuing the following command: >>> scipy.io.wavfile.write("filename.wav",rate,data) Filters A filter is an operation on signals that either removes features or extracts some component. SciPy has a very complete set of known filters, as well as the tools to allow construction of new ones. The complete list of filters in SciPy is long, and we encourage the reader to explore the help documents of the scipy.signal and scipy.ndimage modules for the complete picture. We will introduce in these pages, as an exposition, some of the most used filters in the treatment of audio or image processing. We start by creating a signal worth filtering: >>> from numpy import sin, cos, pi, linspace >>> f=lambda t: cos(pi*t) + 0.2*sin(5*pi*t+0.1) + 0.2*sin(30*pi*t) + 0.1*sin(32*pi*t+0.1) + 0.1*sin(47* pi*t+0.8) >>> t=linspace(0,4,400); signal=f(t) We first test the classical smoothing filter of Wiener and Kolmogorov, wiener. We present in a plot, the original signal (in black) and the corresponding filtered data, with a choice of a Wiener window of the size 55 samples (in blue). Next, we compare the result of applying the median filter, medfilt, with a kernel of the same size as before (in red): >>> from scipy.signal import wiener, medfilt >>> import matplotlib.pylab as plt >>> plt.plot(t,signal,'k') >>> plt.plot(t,wiener(signal,mysize=55),'r',linewidth=3) >>> plt.plot(t,medfilt(signal,kernel_size=55),'b',linewidth=3) >>> plt.show() This gives us the following graph showing the comparison of smoothing filters (wiener is the one that has its starting point just below 0.5 and medfilt has its starting point just above 0.5): Most of the filters in the scipy.signal module can be adapted to work in arrays of any dimension. But in the particular case of images, we prefer to use the implementations in the scipy.ndimage module, since they are coded with these objects in mind. For instance, to perform a median filter on an image for smoothing, we use scipy.ndimage.median_filter. Let's see an example. We will start by loading Lena to the array and corrupting the image with Gaussian noise (zero mean and standard deviation of 16): >>> from scipy.stats import norm # Gaussian distribution >>> import matplotlib.pyplot as plt >>> import scipy.misc >>> import scipy.ndimage >>> plt.gray() >>> lena=scipy.misc.lena().astype(float) >>> plt.subplot(221); >>> plt.imshow(lena) >>> lena+=norm(loc=0,scale=16).rvs(lena.shape) >>> plt.subplot(222); >>> plt.imshow(lena) >>> denoised_lena = scipy.ndimage.median_filter(lena,3) >>> plt.subplot(224); >>> plt.imshow(denoised_lena) The set of filters for images come in two flavors—statistical and morphological. For example, among the filters of statistical nature, we have the Sobel algorithm oriented to detection of edges (singularities along curves). Its syntax is as follows: sobel(image, axis=-1, output=None, mode='reflect', cval=0.0) The optional parameter, axis, indicates the dimension in which the computations are performed. By default, this is always the last axis (-1). The mode parameter, which is one of the strings 'reflect', 'constant', 'nearest', 'mirror', or 'wrap', indicates how to handle the border of the image, in case there is insufficient data to perform the computations there. In case the mode is 'constant', we may indicate the value to use in the border, with the cval parameter. Let's look into the following code snippet, which illustrates the use of the sobel filter: >>> from scipy.ndimage.filters import sobel >>> import numpy >>> lena=scipy.misc.lena() >>> sblX=sobel(lena,axis=0); sblY=sobel(lena,axis=1) >>> sbl=numpy.hypot(sblX,sblY) >>> plt.subplot(223); >>> plt.imshow(sbl) >>> plt.show() The following screenshot illustrates Lena (upper-left) and noisy Lena (upper-right) with the preceding two filters in action—edge map with sobel (lower-left) and median filter (lower-right): Morphology We also have the possibility of creating and applying filters to images based on mathematical morphology, both to binary and gray-scale images. The four basic morphological operations are opening (binary_opening), closing (binary_closing), dilation (binary_dilation), and erosion (binary_erosion). Note that the syntax for each of these filters is very simple, since we only need two ingredients—the signal to filter and the structuring element to perform the morphological operation. Let's take a look into the general syntax for these morphological operations: binary_operation(signal, structuring_element) We may use combinations of these four basic morphological operations to create more complex filters for removal of holes, hit-or-miss transforms (to find the location of specific patterns in binary images), denoising, edge detection, and many more. The SciPy module also allows for creating some common filters using the preceding syntax. For instance, for the location of the letter e in a text, we could use the following command instead: >>> binary_hit_or_miss(text, letterE) For comparative purposes, let's use this command in the following code snippet: >>> import numpy >>> import scipy.ndimage >>> import matplotlib.pylab as plt >>> from scipy.ndimage.morphology import binary_hit_or_miss >>> text = scipy.ndimage.imread('CHAP_05_input_textImage.png') >>> letterE = text[37:53,275:291] >>> HitorMiss = binary_hit_or_miss(text, structure1=letterE, origin1=1) >>> eLocation = numpy.where(HitorMiss==True) >>> x=eLocation[1]; y=eLocation[0] >>> plt.imshow(text, cmap=plt.cm.gray, interpolation='nearest') >>> plt.autoscale(False) >>> plt.plot(x,y,'wo',markersize=10) >>> plt.axis('off') >>> plt.show() The output for the preceding lines of code is generated as follows: For gray-scale images, we may use a structuring element (structuring_element) or a footprint. The syntax is, therefore, a little different: grey_operation(signal, [structuring_element, footprint, size, ...]) If we desire to use a completely flat and rectangular structuring element (all ones), then it is enough to indicate the size as a tuple. For instance, to perform gray-scale dilation of a flat element of size (15,15) on our classical image of Lena, we issue the following command: >>> grey_dilation(lena, size=(15,15)) The last kind of morphological operations coded in the scipy.ndimage module perform distance and feature transforms. Distance transforms create a map that assigns to each pixel, the distance to the nearest object. Feature transforms provide with the index of the closest background element instead. These operations are used to decompose images into different labels. We may even choose different metrics such as Euclidean distance, chessboard distance, and taxicab distance. The syntax for the distance transform (distance_transform) using a brute force algorithm is as follows: distance_transform_bf(signal, metric='euclidean', sampling=None, return_distances=True, return_indices=False, distances=None, indices=None) We indicate the metric with the strings such as 'euclidean', 'taxicab', or 'chessboard'. If we desire to provide the feature transform instead, we switch return_distances to False and return_indices to True. Similar routines are available with more sophisticated algorithms—distance_transform_cdt (using chamfering for taxicab and chessboard distances). For Euclidean distance, we also have distance_transform_edt. All these use the same syntax. Summary In this article, we explored signal processing (any dimensional) including the treatment of signals in frequency space, by means of their Discrete Fourier Transforms. These correspond to the fftpack, signal, and ndimage modules. Resources for Article: Further resources on this subject: Signal Processing Techniques [article] SciPy for Computational Geometry [article] Move Further with NumPy Modules [article]

0
0
5786

article-image-performance-considerations

Packt

03 Mar 2015

13 min read

Performance Considerations

Packt

03 Mar 2015

13 min read

0
0
1126

Packt

03 Mar 2015

11 min read

MapReduce functions

Packt

03 Mar 2015

11 min read

In this article, by John Zablocki, author of the book, Couchbase Essentials, you will be acquainted to MapReduce and how you'll use it to create secondary indexes for our documents. At its simplest, MapReduce is a programming pattern used to process large amounts of data that is typically distributed across several nodes in parallel. In the NoSQL world, MapReduce implementations may be found on many platforms from MongoDB to Hadoop, and of course, Couchbase. Even if you're new to the NoSQL landscape, it's quite possible that you've already worked with a form of MapReduce. The inspiration for MapReduce in distributed NoSQL systems was drawn from the functional programming concepts of map and reduce. While purely functional programming languages haven't quite reached mainstream status, languages such as Python, C#, and JavaScript all support map and reduce operations. (For more resources related to this topic, see here.) Map functions Consider the following Python snippet: numbers = [1, 2, 3, 4, 5] doubled = map(lambda n: n * 2, numbers) #doubled == [2, 4, 6, 8, 10] These two lines of code demonstrate a very simple use of a map() function. In the first line, the numbers variable is created as a list of integers. The second line applies a function to the list to create a new mapped list. In this case, the map() function is supplied as a Python lambda, which is just an inline, unnamed function. The body of lambda multiplies each number by two. This map() function can be made slightly more complex by doubling only odd numbers, as shown in this code: numbers = [1, 2, 3, 4, 5] defdouble_odd(num): if num % 2 == 0: return num else: return num * 2 doubled = map(double_odd, numbers) #doubled == [2, 2, 6, 4, 10] Map functions are implemented differently in each language or platform that supports them, but all follow the same pattern. An iterable collection of objects is passed to a map function. Each item of the collection is then iterated over with the map function being applied to that iteration. The final result is a new collection where each of the original items is transformed by the map. Reduce functions Like maps, the reduce functions also work by applying a provided function to an iterable data structure. The key difference between the two is that the reduce function works to produce a single value from the input iterable. Using Python's built-in reduce() function, we can see how to produce a sum of integers, as follows: numbers = [1, 2, 3, 4, 5] sum = reduce(lambda x, y: x + y, numbers) #sum == 15 You probably noticed that unlike our map operation, the reduce lambda has two parameters (x and y in this case). The argument passed to x will be the accumulated value of all applications of the function so far, and y will receive the next value to be added to the accumulation. Parenthetically, the order of operations can be seen as ((((1 + 2) + 3) + 4) + 5). Alternatively, the steps are shown in the following list: x = 1, y = 2 x = 3, y = 3 x = 6, y = 4 x = 10, y = 5 x = 15 As this list demonstrates, the value of x is the cumulative sum of previous x and y values. As such, reduce functions are sometimes termed accumulate or fold functions. Regardless of their name, reduce functions serve the common purpose of combining pieces of a recursive data structure to produce a single value. Couchbase MapReduce Creating an index (or view) in Couchbase requires creating a map function written in JavaScript. When the view is created for the first time, the map function is applied to each document in the bucket containing the view. When you update a view, only new or modified documents are indexed. This behavior is known as incremental MapReduce. You can think of a basic map function in Couchbase as being similar to a SQL CREATE INDEX statement. Effectively, you are defining a column or a set of columns, to be indexed by the server. Of course, these are not columns, but rather properties of the documents to be indexed. Basic mapping To illustrate the process of creating a view, first imagine that we have a set of JSON documents as shown here: var books=[ { "id": 1, "title": "The Bourne Identity", "author": "Robert Ludlow" }, { "id": 2, "title": "The Godfather", "author": "Mario Puzzo" }, { "id": 3, "title": "Wiseguy", "author": "Nicholas Pileggi" } ]; Each document contains title and author properties. In Couchbase, to query these documents by either title or author, we'd first need to write a map function. Without considering how map functions are written in Couchbase, we're able to understand the process with vanilla JavaScript: books.map(function(book) { return book.author; }); In the preceding snippet, we're making use of the built-in JavaScript array's map() function. Similar to the Python snippets we saw earlier, JavaScript's map() function takes a function as a parameter and returns a new array with mapped objects. In this case, we'll have an array with each book's author, as follows: ["Robert Ludlow", "Mario Puzzo", "Nicholas Pileggi"] At this point, we have a mapped collection that will be the basis for our author index. However, we haven't provided a means for the index to be able to refer back to its original document. If we were using a relational database, we'd have effectively created an index on the Title column with no way to get back to the row that contained it. With a slight modification to our map function, we are able to provide the key (the id property) of the document as well in our index: books.map(function(book) { return [book.author, book.id]; }); In this slightly modified version, we're including the ID with the output of each author. In this way, the index has its document's key stored with its title. [["The Bourne Identity", 1], ["The Godfather", 2], ["Wiseguy", 3]] We'll soon see how this structure more closely resembles the values stored in a Couchbase index. Basic reducing Not every Couchbase index requires a reduce component. In fact, we'll see that Couchbase already comes with built-in reduce functions that will provide you with most of the reduce behavior you need. However, before relying on only those functions, it's important to understand why you'd use a reduce function in the first place. Returning to the preceding example of the map, let's imagine we have a few more documents in our set, as follows: var books=[ { "id": 1, "title": "The Bourne Identity", "author": "Robert Ludlow" }, { "id": 2, "title": "The Bourne Ultimatum", "author": "Robert Ludlow" }, { "id": 3, "title": "The Godfather", "author": "Mario Puzzo" }, { "id": 4, "title": "The Bourne Supremacy", "author": "Robert Ludlow" }, { "id": 5, "title": "The Family", "author": "Mario Puzzo" }, { "id": 6, "title": "Wiseguy", "author": "Nicholas Pileggi" } ]; We'll still create our index using the same map function because it provides a way of accessing a book by its author. Now imagine that we want to know how many books an author has written, or (assuming we had more data) the average number of pages written by an author. These questions are not possible to answer with a map function alone. Each application of the map function knows nothing about the previous application. In other words, there is no way for you to compare or accumulate information about one author's book to another book by the same author. Fortunately, there is a solution to this problem. As you've probably guessed, it's the use of a reduce function. As a somewhat contrived example, consider this JavaScript: mapped = books.map(function (book) { return ([book.id, book.author]); }); counts = {} reduced = mapped.reduce(function(prev, cur, idx, arr) { var key = cur[1]; if (! counts[key]) counts[key] = 0; ++counts[key] }, null); This code doesn't quite accurately reflect the way you would count books with Couchbase but it illustrates the basic idea. You look for each occurrence of a key (author) and increment a counter when it is found. With Couchbase MapReduce, the mapped structure is supplied to the reduce() function in a better format. You won't need to keep track of items in a dictionary. Couchbase views At this point, you should have a general sense of what MapReduce is, where it came from, and how it will affect the creation of a Couchbase Server view. So without further ado, let's see how to write our first Couchbase view. In fact, there were two to choose from. The bucket we'll use is beer-sample. If you didn't install it, don't worry. You can add it by opening the Couchbase Console and navigating to the Settings tab. Here, you'll find the option to install the bucket, as shown next: First, you need to understand the document structures with which you're working. The following JSON object is a beer document (abbreviated for brevity): { "name": "Sundog", "type": "beer", "brewery_id": "new_holland_brewing_company", "description": "Sundog is an amber ale...", "style": "American-Style Amber/Red Ale", "category": "North American Ale" } As you can see, the beer documents have several properties. We're going to create an index to let us query these documents by name. In SQL, the query would look like this: SELECT Id FROM Beers WHERE Name = ? You might be wondering why the SQL example includes only the Id column in its projection. For now, just know that to query a document using a view with Couchbase, the property by which you're querying must be included in an index. To create that index, we'll write a map function. The simplest example of a map function to query beer documents by name is as follows: function(doc) { emit(doc.name); } This body of the map function has only one line. It calls the built-in Couchbase emit() function. This function is used to signal that a value should be indexed. The output of this map function will be an array of names. The beer-sample bucket includes brewery data as well. These documents look like the following code (abbreviated for brevity): { "name": "Thomas Hooker Brewing", "city": "Bloomfield", "state": "Connecticut", "website": "http://www.hookerbeer.com/", "type": "brewery" } If we reexamine our map function, we'll see an obvious problem; both the brewery and beer documents have a name property. When this map function is applied to the documents in the bucket, it will create an index with documents from either the brewery or beer documents. The problem is that Couchbase documents exist in a single container—the bucket. There is no namespace for a set of related documents. The solution has typically involved including a type or docType property on each document. The value of this property is used to distinguish one document from another. In the case of the beer-sample database, beer documents have type = "beer" and brewery documents have type = "brewery". Therefore, we are easily able to modify our map function to create an index only on beer documents: function(doc) { if (doc.type == "beer") { emit(doc.name); } } The emit() function actually takes two arguments. The first, as we've seen, emits a value to be indexed. The second argument is an optional value and is used by the reduce function. Imagine that we want to count the number of beer types in a particular category. In SQL, we would write the following query: SELECT Category, COUNT(*) FROM Beers GROUP BY Category To achieve the same functionality with Couchbase Server, we'll need to use both map and reduce functions. First, let's write the map. It will create an index on the category property: function(doc) { if (doc.type == "beer") { emit(doc.category, 1); } } The only real difference between our category index and our name index is that we're including an argument for the value parameter of the emit() function. What we'll do with that value is simply count them. This counting will be done in our reduce function: function(keys, values) { return values.length; } In this example, the values parameter will be given to the reduce function as a list of all values associated with a particular key. In our case, for each beer category, there will be a list of ones (that is, [1, 1, 1, 1, 1, 1]). Couchbase also provides a built-in _count function. It can be used in place of the entire reduce function in the preceding example. Now that we've seen the basic requirements when creating an actual Couchbase view, it's time to add a view to our bucket. The easiest way to do so is to use the Couchbase Console. Summary In this article, you learned the purpose of secondary indexes in a key/value store. We dug deep into MapReduce, both in terms of its history in functional languages and as a tool for NoSQL and big data systems. Resources for Article: Further resources on this subject: Map Reduce? [article] Introduction to Mapreduce [article] Working with Apps Splunk [article]

0
0
1356

article-image-getting-started-postgresql

Packt

03 Mar 2015

11 min read

Getting Started with PostgreSQL

Packt

03 Mar 2015

11 min read

0
0
1197

article-image-postgresql-extensible-rdbms

Packt

03 Mar 2015

18 min read

PostgreSQL as an Extensible RDBMS

Packt

03 Mar 2015

18 min read

0
0
3144

Packt

03 Mar 2015

14 min read

Introducing Splunk

Packt

03 Mar 2015

14 min read

In this article by Betsy Page Sigman, author of the book Splunk Essentials, Splunk, whose "name was inspired by the process of exploring caves, or splunking, helps analysts, operators, programmers, and many others explore data from their organizations by obtaining, analyzing, and reporting on it. This multinational company, cofounded by Michael Baum, Rob Das, and Erik Swan, has a core product called "Splunk Enterprise. This manages searches, inserts, deletes, and filters, and analyzes big data that is generated by machines, as well as other types of data. "They also have a free version that has most of the capabilities of Splunk Enterprise and is an excellent learning tool. (For more resources related to this topic, see here.) Understanding events, event types, and fields in Splunk An understanding of events and event types is important before going further. Events In Splunk, an event is not just one of" the many local user meetings that are set up between developers to help each other out (although those can be very useful), "but also refers to a record of one activity that is recorded in a log file. Each event usually has: A timestamp indicating the date and exact time the event was created Information about what happened on the system that is being tracked Event types An event type is a way to allow "users to categorize similar events. It is field-defined by the user. You can define an event type in several ways, and the easiest way is by using the SplunkWeb interface. One common reason for setting up an event type is to examine why a system has failed. Logins are often problematic for systems, and a search for failed logins can help pinpoint problems. For an interesting example of how to save "a search on failed logins as an event type, visit http://docs.splunk.com/Documentation/Splunk/6.1.3/Knowledge/ClassifyAndGroupSimilarEvents#Save_a_search_as_a_new_event_type. Why are events and event types so important in Splunk? Because without events, there would be nothing to search, of course. And event types allow us to make meaningful searches easily and quickly according to our needs, as we'll see later. Sourcetypes Sourcetypes are also "important to understand, as they help define the rules for an event. A sourcetype is one of the default fields that Splunk assigns to data as it comes into the system. It determines what type of data it is so that Splunk can format it appropriately as it indexes it. This also allows the user who wants to search the "data to easily categorize it. Some of the common sourcetypes are listed as follows: access_combined, for "NCSA combined format HTTP web server logs apache_error, for standard "Apache web server error logs cisco_syslog, for the "standard syslog produced by Cisco network devices (including PIX firewalls, routers, and ACS), usually via remote syslog to a central log host websphere_core, a core file" export from WebSphere (Source: http://docs.splunk.com/Documentation/Splunk/latest/Data/Whysourcetypesmatter) Fields Each event in Splunk is" associated with a number of fields. The core fields of host, course, sourcetype, and timestamp are key to Splunk. These fields are extracted from events at multiple points in the data processing pipeline that Splunk uses, and each of these fields includes a name and a value. The name describes the field (such as the userid) and the value says what that field's value is (susansmith, for example). Some of these fields are default fields that are given because of where the event came from or what it is. When data is processed by Splunk, and when it is indexed or searched, it uses these fields. For indexing, the default fields added include those of host, source, and sourcetype. When searching, Splunk is able to select from a bevy of fields that can either be defined by the user or are very basic, such as action results in a purchase (for a website event). Fields are essential for doing the basic work of Splunk – that is, indexing and searching. Getting data into Splunk It's time to spring into action" now and input some data into Splunk. Adding data is "simple, easy, and quick. In this section, we will use some data and tutorials created by Splunk to learn how to add data: Firstly, to obtain your data, visit the tutorial data at http://docs.splunk.com/Documentation/Splunk/6.1.5/SearchTutorial/GetthetutorialdataintoSplunk that is readily available on Splunk. Here, download the folder tutorialdata.zip. Note that this will be a fresh dataset that has been collected over the last 7 days. Download it but don't extract the data from it just yet. You then need to log in to Splunk, using admin as the username and then by using your password. Once logged in, you will notice that toward the upper-right corner of your screen is the button Add Data, as shown in the following screenshot. Click "on this button: Button to Add Data Once you have "clicked on this button, you'll see a screen" similar to the "following screenshot: Add Data to Splunk by Choosing a Data Type or Data Source Notice here the "different types of data that you can select, as "well as the different data sources. Since the data we're going to use is a file, under "Or Choose a Data Source, click on From files and directories. Once you have clicked on this, you can then click on the radio button next to Skip preview, as indicated in the following screenshot, since you don't need to preview the data" now. You then need to click on "Continue: Preview data You can download the tutorial files at: http://docs.splunk.com/Documentation/Splunk/6.1.5/SearchTutorial/GetthetutorialdataintoSplunk As shown in the next screenshot, click on Upload and index a file, find the tutorialdata.zip file you just downloaded (it is probably in your Downloads folder), and then click on More settings, filling it in as shown in the following screenshot. (Note that you will need to select Segment in path under Host and type 1 under Segment Number.) Click on Save when you are done: Can specify source, additional settings, and source type Following this, you "should see a screen similar to the following" screenshot. Click on Start Searching, we will look at the data now: You should see this if your data has been successfully indexed into Splunk. You will now" see a screen similar to the following" screenshot. Notice that the number of events you have will be different, as will the time of the earliest event. At this point, click on Data Summary: The Search screen You should see the Data Summary screen like in the following screenshot. However, note that the Hosts shown here will not be the same as the ones you get. Take a quick look at what is on the Sources tab and the Sourcetypes tab. Then find the most recent data (in this case 127.0.0.1) and click on it. Data Summary, where you can see Hosts, Sources, and Sourcetypes After" clicking on the most recent data, which in "this case is bps-T341s, look at the events contained there. Later, when we use streaming data, we can see how the events at the top of this list change rapidly. Here, you will see a listing of events, similar to those shown in the "following screenshot: Events lists for the host value You can click on the Splunk logo in the upper-left corner "of the web page to return to the home page. Under Administrator at the "top-right of the page, click on Logout. Searching Twitter data We will start here by doing a simple search of our Twitter index, which is automatically created by the app once you have enabled Twitter input (as explained previously). In our earlier searches, we used the default index (which the tutorial data was downloaded to), so we didn't have to specify the index we wanted to use. Here, we will use just the Twitter index, so we need to specify that in the search. A simple search Imagine that we wanted to search for tweets containing the word coffee. We could use the code presented here and place it in the search bar: index=twitter text=*coffee* The preceding code searches only your Twitter index and finds all the places where the word coffee is mentioned. You have to put asterisks there, otherwise you will only get the tweets with just "coffee". (Note that the text field is not case sensitive, so tweets with either "coffee" or "Coffee" will be included in the search results.) The asterisks are included before and after the text "coffee" because otherwise we would only get events where just "coffee" was tweeted – a rather rare occurrence, we expect. In fact, when we search our indexed Twitter data without the asterisks around coffee, we got no results. Examining the Twitter event Before going further, it is useful to stop and closely examine the events that are collected as part of the search. The sample tweet shown in the following screenshot shows the large number of fields that are part of each tweet. The > was clicked to expand the event: A Twitter event There are several items to look closely at here: _time: Splunk assigns a timestamp for every event. This is done in UTC (Coordinated Universal Time) time format. contributors: The value for this field is null, as are the values of many Twitter fields. Retweeted_status: Notice the {+} here; in the following event list, you will see there are a number of fields associated with this, which can be seen when the + is selected and the list is expanded. This is the case wherever you see a {+} in a list of fields: Various retweet fields In addition to those shown previously, there are many other fields associated with a tweet. The 140 character (maximum) text field that most people consider to be the tweet is actually a small part of the actual data collected. The implied AND If you want to search on more than one term, there is no need to add AND as it is already implied. If, for example, you want to search for all tweets that include both the text "coffee" and the text "morning", then use: index=twitter text=*coffee* text=*morning* If you don't specify text= for the second term and just put *morning*, Splunk assumes that you want to search for *morning* in any field. Therefore, you could get that word in another field in an event. This isn't very likely in this case, although coffee could conceivably be part of a user's name, such as "coffeelover". But if you were searching for other text strings, such as a computer term like log or error, such terms could be found in a number of fields. So specifying the field you are interested in would be very important. The need to specify OR Unlike AND, you must always specify the word OR. For example, to obtain all events that mention either coffee or morning, enter: index=twitter text=*coffee* OR text=*morning* Finding other words used Sometimes you might want to find out what other words are used in tweets about coffee. You can do that with the following search: index=twitter text=*coffee* | makemv text | mvexpand text | top 30 text This search first searches for the word "coffee" in a text field, then creates a multivalued field from the tweet, and then expands it so that each word is treated as a separate piece of text. Then it takes the top 30 words that it finds. You might be asking yourself how you would use this kind of information. This type of analysis would be of interest to a marketer, who might want to use words that appear to be associated with coffee in composing the script for an advertisement. The following screenshot shows the results that appear (1 of 2 pages). From this search, we can see that the words love, good, and cold might be words worth considering: Search of top 30 text fields found with *coffee* When you do a search like this, you will notice that there are a lot of filler words (a, to, for, and so on) that appear. You can do two things to remedy this. You can increase the limit for top words so that you can see more of the words that come up, or you can rerun the search using the following code. "Coffee" (with a capital C) is listed (on the unshown second page) separately here from "coffee". The reason for this is that while the search is not case sensitive (thus both "coffee" and "Coffee" are picked up when you search on "coffee"), the process of putting the text fields through the makemv and the mvexpand processes ends up distinguishing on the basis of case. We could rerun the search, excluding some of the filler words, using the code shown here: index=twitter text=*coffee* | makemv text | mvexpand text |search NOT text="RT" AND NOT text="a" AND NOT text="to" ANDNOT text="the" | top 30 text Using a lookup table Sometimes it is useful to use a lookup file to avoid having to use repetitive code. It would help us to have a list of all the small words that might be found often in a tweet just by the nature of each word's frequent use in language, so that we might eliminate them from our quest to find words that would be relevant for use in the creation of advertising. If we had a file of such small words, we could use a command indicating not to use any of these more common, irrelevant words when listing the top 30 words associated with our search topic of interest. Thus, for our search for words associated with the text "coffee", we would be interested in words like " dark", "flavorful", and "strong", but not words like "a", "the", and "then". We can do this using a lookup command. There are three types of lookup commands, which are presented in the following table: Command Description lookup Matches a value of one field with a value of another, based on a .csv file with the two fields. Consider a lookup table named lutable that contains fields for machine_name and owner. Consider what happens when the following code snippet is used after a preceding search (indicated by . . . |): . . . | lookup lutable owner Splunk will use the lookup table to match the owner's name with its machine_name and add the machine_name to each event. inputlookup All fields in the .csv file are returned as results. If the following code snippet is used, both machine_name and owner would be searched: . . . | inputlookup lutable outputlookup This code outputs search results to a lookup table. The following code outputs results from the preceding research directly into a table it creates: . . . | outputlookup newtable.csv saves The command we will use here is inputlookup, because we want to reference a .csv file we can create that will include words that we want to filter out as we seek to find possible advertising words associated with coffee. Let's call the .csv file filtered_words.csv, and give it just a single text field, containing words like "is", "the", and "then". Let's rewrite the search to look like the following code: index=twitter text=*coffee*| makemv text | mvexpand text| search NOT [inputlookup filtered_words | fields text ]| top 30 text Using the preceding code, Splunk will search our Twitter index for *coffee*, and then expand the text field so that individual words are separated out. Then it will look for words that do NOT match any of the words in our filtered_words.csv file, and finally output the top 30 most frequently found words among those. As you can see, the lookup table can be very useful. To learn more about Splunk lookup tables, go to http://docs.splunk.com/Documentation/Splunk/6.1.5/SearchReference/Lookup. Summary In this article, we have learned more about how to use Splunk to create reports, dashboards. Splunk Enterprise Software, or Splunk, is an extremely powerful tool for searching, exploring, and visualizing data of all types. Splunk is becoming increasingly popular, as more and more businesses, both large and small, discover its ease and usefulness. Analysts, managers, students, and others can quickly learn how to use the data from their systems, networks, web traffic, and social media to make attractive and informative reports. This is a straightforward, practical, and quick introduction to Splunk that should have you making reports and gaining insights from your data in no time. Resources for Article: Further resources on this subject: Lookups [article] Working with Apps in Splunk [article] Loading data, creating an app, and adding dashboards and reports in Splunk [article]

0
0
2527

Packt

02 Mar 2015

15 min read

A Quick Start Guide to Flume

Packt

02 Mar 2015

15 min read

In this article by Steve Hoffman, the author of the book, Apache Flume: Distributed Log Collection for Hadoop - Second Edition, we will learn about the basics that are required to be known before we start working with Apache Flume. This article will help you get started with Flume. So, let's start with the first step: downloading and configuring Flume. (For more resources related to this topic, see here.) Downloading Flume Let's download Flume from http://flume.apache.org/. Look for the download link in the side navigation. You'll see two compressed .tar archives available along with the checksum and GPG signature files used to verify the archives. Instructions to verify the download are on the website, so I won't cover them here. Checking the checksum file contents against the actual checksum verifies that the download was not corrupted. Checking the signature file validates that all the files you are downloading (including the checksum and signature) came from Apache and not some nefarious location. Do you really need to verify your downloads? In general, it is a good idea and it is recommended by Apache that you do so. If you choose not to, I won't tell. The binary distribution archive has bin in the name, and the source archive is marked with src. The source archive contains just the Flume source code. The binary distribution is much larger because it contains not only the Flume source and the compiled Flume components (jars, javadocs, and so on), but also all the dependent Java libraries. The binary package contains the same Maven POM file as the source archive, so you can always recompile the code even if you start with the binary distribution. Go ahead, download and verify the binary distribution to save us some time in getting started. Flume in Hadoop distributions Flume is available with some Hadoop distributions. The distributions supposedly provide bundles of Hadoop's core components and satellite projects (such as Flume) in a way that ensures things such as version compatibility and additional bug fixes are taken into account. These distributions aren't better or worse; they're just different. There are benefits to using a distribution. Someone else has already done the work of pulling together all the version-compatible components. Today, this is less of an issue since the Apache BigTop project started (http://bigtop.apache.org/). Nevertheless, having prebuilt standard OS packages, such as RPMs and DEBs, ease installation as well as provide startup/shutdown scripts. Each distribution has different levels of free and paid options, including paid professional services if you really get into a situation you just can't handle. There are downsides, of course. The version of Flume bundled in a distribution will often lag quite a bit behind the Apache releases. If there is a new or bleeding-edge feature you are interested in using, you'll either be waiting for your distribution's provider to backport it for you, or you'll be stuck patching it yourself. Furthermore, while the distribution providers do a fair amount of testing, such as any general-purpose platform, you will most likely encounter something that their testing didn't cover, in which case, you are still on the hook to come up with a workaround or dive into the code, fix it, and hopefully, submit that patch back to the open source community (where, at a future point, it'll make it into an update of your distribution or the next version). So, things move slower in a Hadoop distribution world. You can see that as good or bad. Usually, large companies don't like the instability of bleeding-edge technology or making changes often, as change can be the most common cause of unplanned outages. You'd be hard pressed to find such a company using the bleeding-edge Linux kernel rather than something like Red Hat Enterprise Linux (RHEL), CentOS, Ubuntu LTS, or any of the other distributions whose target is stability and compatibility. If you are a startup building the next Internet fad, you might need that bleeding-edge feature to get a leg up on the established competition. If you are considering a distribution, do the research and see what you are getting (or not getting) with each. Remember that each of these offerings is hoping that you'll eventually want and/or need their Enterprise offering, which usually doesn't come cheap. Do your homework. Here's a short, nondefinitive list of some of the more established players. For more information, refer to the following links: Cloudera: http://cloudera.com/ Hortonworks: http://hortonworks.com/ MapR: http://mapr.com/ An overview of the Flume configuration file Now that we've downloaded Flume, let's spend some time going over how to configure an agent. A Flume agent's default configuration provider uses a simple Java property file of key/value pairs that you pass as an argument to the agent upon startup. As you can configure more than one agent in a single file, you will need to additionally pass an agent identifier (called a name) so that it knows which configurations to use. In my examples where I'm only specifying one agent, I'm going to use the name agent. By default, the configuration property file is monitored for changes every 30 seconds. If a change is detected, Flume will attempt to reconfigure itself. In practice, many of the configuration settings cannot be changed after the agent has started. Save yourself some trouble and pass the undocumented --no-reload-conf argument when starting the agent (except in development situations perhaps). If you use the Cloudera distribution, the passing of this flag is currently not possible. I've opened a ticket to fix that at https://issues.cloudera.org/browse/DISTRO-648. If this is important to you, please vote it up. Each agent is configured, starting with three parameters: agent.sources=<list of sources>agent.channels=<list of channels>agent.sinks=<list of sinks> Each source, channel, and sink also has a unique name within the context of that agent. For example, if I'm going to transport my Apache access logs, I might define a channel named access. The configurations for this channel would all start with the agent.channels.access prefix. Each configuration item has a type property that tells Flume what kind of source, channel, or sink it is. In this case, we are going to use an in-memory channel whose type is memory. The complete configuration for the channel named access in the agent named agent would be: agent.channels.access.type=memory Any arguments to a source, channel, or sink are added as additional properties using the same prefix. The memory channel has a capacity parameter to indicate the maximum number of Flume events it can hold. Let's say we didn't want to use the default value of 100; our configuration would now look like this: agent.channels.access.type=memoryagent.channels.access.capacity=200 Finally, we need to add the access channel name to the agent.channels property so that the agent knows to load it: agent.channels=access Let's look at a complete example using the canonical "Hello, World!" example. Starting up with "Hello, World!" No technical article would be complete without a "Hello, World!" example. Here is the configuration file we'll be using: agent.sources=s1agent.channels=c1agent.sinks=k agent.sources.s1.type=netcatagent.sources.s1.channels=c1agent.sources.s1.bind=0.0.0.0agent.sources.s1.port=1234 agent.channels.c1.type=memory agent.sinks.k1.type=loggeragent.sinks.k1.channel=c1 Here, I've defined one agent (called agent) who has a source named s1, a channel named c1, and a sink named k1. The s1 source's type is netcat, which simply opens a socket listening for events (one line of text per event). It requires two parameters: a bind IP and a port number. In this example, we are using 0.0.0.0 for a bind address (the Java convention to specify listen on any address) and port 12345. The source configuration also has a parameter called channels (plural), which is the name of the channel(s) the source will append events to, in this case, c1. It is plural, because you can configure a source to write to more than one channel; we just aren't doing that in this simple example. The channel named c1 is a memory channel with a default configuration. The sink named k1 is of the logger type. This is a sink that is mostly used for debugging and testing. It will log all events at the INFO level using Log4j, which it receives from the configured channel, in this case, c1. Here, the channel keyword is singular because a sink can only be fed data from one channel. Using this configuration, let's run the agent and connect to it using the Linux netcat utility to send an event. First, explode the .tar archive of the binary distribution we downloaded earlier: $ tar -zxf apache-flume-1.5.2-bin.tar.gz$ cd apache-flume-1.5.2-bin Next, let's briefly look at the help. Run the flume-ng command with the help command: $ ./bin/flume-ng helpUsage: ./bin/flume-ng <command> [options]... commands:help display this help textagent run a Flume agentavro-client run an avro Flume clientversion show Flume version info global options:--conf,-c <conf> use configs in <conf> directory--classpath,-C <cp> append to the classpath--dryrun,-d do not actually start Flume, just print the command--plugins-path <dirs> colon-separated list of plugins.d directories. See the plugins.d section in the user guide for more details. Default: $FLUME_HOME/plugins.d-Dproperty=value sets a Java system property value-Xproperty=value sets a Java -X option agent options:--conf-file,-f <file> specify a config file (required)--name,-n <name> the name of this agent (required)--help,-h display help text avro-client options:--rpcProps,-P <file> RPC client properties file with server connection params--host,-H <host> hostname to which events will be sent--port,-p <port> port of the avro source--dirname <dir> directory to stream to avro source--filename,-F <file> text file to stream to avro source (default: std input)--headerFile,-R <file> File containing event headers as key/value pairs on each new line--help,-h display help text Either --rpcProps or both --host and --port must be specified. Note that if <conf> directory is specified, then it is always included first in the classpath. As you can see, there are two ways with which you can invoke the command (other than the simple help and version commands). We will be using the agent command. The use of avro-client will be covered later. The agent command has two required parameters: a configuration file to use and the agent name (in case your configuration contains multiple agents). Let's take our sample configuration and open an editor (vi in my case, but use whatever you like): $ vi conf/hw.conf Next, place the contents of the preceding configuration into the editor, save, and exit back to the shell. Now you can start the agent: $ ./bin/flume-ng agent -n agent -c conf -f conf/hw.conf -Dflume.root.logger=INFO,console The -Dflume.root.logger property overrides the root logger in conf/log4j.properties to use the console appender. If we didn't override the root logger, everything would still work, but the output would go to the log/flume.log file instead of being based on the contents of the default configuration file. Of course, you can edit the conf/log4j.properties file and change the flume.root.logger property (or anything else you like). To change just the path or filename, you can set the flume.log.dir and flume.log.file properties in the configuration file or pass additional flags on the command line as follows: $ ./bin/flume-ng agent -n agent -c conf -f conf/hw.conf -Dflume.root.logger=INFO,console -Dflume.log.dir=/tmp -Dflume.log.file=flume-agent.log You might ask why you need to specify the -c parameter, as the -f parameter contains the complete relative path to the configuration. The reason for this is that the Log4j configuration file should be included on the class path. If you left the -c parameter off the command, you'll see this error: Warning: No configuration directory set! Use --conf <dir> to override.log4j:WARN No appenders could be found for logger (org.apache.flume.lifecycle.LifecycleSupervisor).log4j:WARN Please initialize the log4j system properly.log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info But you didn't do that so you should see these key log lines: 2014-10-05 15:39:06,109 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration.validateConfiguration(FlumeConfiguration.java:140)] Post-validation flume configuration contains configuration foragents: [agent] This line tells you that your agent starts with the name agent. Usually you'd look for this line only to be sure you started the right configuration when you have multiple configurations defined in your configuration file. 2014-10-05 15:39:06,076 (conf-file-poller-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:133)] Reloadingconfiguration file:conf/hw.conf This is another sanity check to make sure you are loading the correct file, in this case our hw.conf file. 2014-10-05 15:39:06,221 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:138)]Starting new configuration:{ sourceRunners:{s1=EventDrivenSourceRunner: { source:org.apache.flume.source.NetcatSource{name:s1,state:IDLE} }} sinkRunners:{k1=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor@442fbe47 counterGroup:{ name:null counters:{} } }}channels:{c1=org.apache.flume.channel.MemoryChannel{name: c1}} } Once all the configurations have been parsed, you will see this message, which shows you everything that was configured. You can see s1, c1, and k1, and which Java classes are actually doing the work. As you probably guessed, netcat is a convenience for org.apache.flume.source.NetcatSource. We could have used the class name if we wanted. In fact, if I had my own custom source written, I would use its class name for the source's type parameter. You cannot define your own short names without patching the Flume distribution. 2014-10-05 15:39:06,427 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:164)] CreatedserverSocket:sun.nio.ch.ServerSocketChannelImpl[/0.0.0.0:12345] Here, we see that our source is now listening on port 12345 for the input. So, let's send some data to it. Finally, open a second terminal. We'll use the nc command (you can use Telnet or anything else similar) to send the Hello World string and press the Return (Enter) key to mark the end of the event: % nc localhost 12345Hello WorldOK The OK message came from the agent after we pressed the Return key, signifying that it accepted the line of text as a single Flume event. If you look at the agent log, you will see the following: 2014-10-05 15:44:11,215 (SinkRunner-PollingRunner-DefaultSinkProcessor)[INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)] Event: { headers:{} body: 48 65 6C 6C 6F 20 57 6F 72 6C 64Hello World } This log message shows you that the Flume event contains no headers (NetcatSource doesn't add any itself). The body is shown in hexadecimal along with a string representation (for us humans to read, in this case, our Hello World message). If I send the following line and then press the Enter key, you'll get an OK message: The quick brown fox jumped over the lazy dog. You'll see this in the agent's log: 2014-10-05 15:44:57,232 (SinkRunner-PollingRunner-DefaultSinkProcessor)[INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)]Event: { headers:{} body: 54 68 65 20 71 75 69 63 6B 20 62 72 6F 77 6E 20The quick brown } The event appears to have been truncated. The logger sink, by design, limits the body content to 16 bytes to keep your screen from being filled with more than what you'd need in a debugging context. If you need to see the full contents for debugging, you should use a different sink, perhaps the file_roll sink, which would write to the local filesystem. Summary In this article, we covered how to download the Flume binary distribution. We created a simple configuration file that included one source writing to one channel, feeding one sink. The source listened on a socket for network clients to connect to and to send it event data. These events were written to an in-memory channel and then fed to a Log4j sink to become the output. We then connected to our listening agent using the Linux netcat utility and sent some string events to our Flume agent's source. Finally, we verified that our Log4j-based sink wrote the events out. Resources for Article: Further resources on this subject: About Cassandra [article] Introducing Kafka [article] Transformation [article]

0
0
5456

How-To Tutorials - Data

Entity Framework Code-First: Accessing Database Views and Stored Procedures

Text Mining with R: Part 1

Pricing the Double-no-touch option

Learning Random Forest Using Mahout

Hadoop and MapReduce

Writing Consumers

Deployment Scenarios

Elasticsearch Administration

SciPy for Signal Processing

Performance Considerations

Trending Topics

MapReduce functions

Getting Started with PostgreSQL

PostgreSQL as an Extensible RDBMS

Introducing Splunk

A Quick Start Guide to Flume