Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

6719 Articles
article-image-getting-started-with-data-visualization-in-tableau
Amarabha Banerjee
13 Feb 2018
5 min read
Save for later

Getting started with Data Visualization in Tableau

Amarabha Banerjee
13 Feb 2018
5 min read
[box type="note" align="" class="" width=""]This article is an book extract from Mastering Tableau, written by David Baldwin. Tableau has emerged as one of the most popular Business Intelligence solutions in recent times, thanks to its powerful and interactive data visualization capabilities. This book will empower you to become a master in Tableau by exploiting the many new features introduced in Tableau 10.0.[/box] In today’s post, we shall explore data visualization basics with Tableau and explore a real world example using these techniques. Tableau Software has a focused vision resulting in a small product line. The main product (and hence the center of the Tableau universe) is Tableau Desktop. Assuming you are a Tableau author, that's where almost all your time will be spent when working with Tableau. But of course you must be able to connect to data and output the results. Thus, as shown in the following figure, the Tableau universe encompasses data sources, Tableau Desktop, and output channels, which include the Tableau Server family and Tableau Reader: Worksheet and dashboard creation At the heart of Tableau are worksheets and dashboards. Worksheets contain individual visualizations and dashboards contain one or more worksheets. Additionally, worksheets and dashboards may be combined into stories to communicate specific insights to the end user via a presentation environment. Lastly, all worksheets, dashboards, and stories are organized in workbooks that can be accessed via Tableau Desktop, Server, or Reader. In this section, we will look at worksheet and dashboard creation with the intent of not only communicating the basics, but also providing some insight that may prove helpful to even more seasoned Tableau authors. Worksheet creation At the most fundamental level, a visualization in Tableau is created by placing one or more fields on one or more shelves. To state this as a pseudo-equation: Field(s) + shelf(s) = Viz As an example, note that the visualization created in the following screenshot is generated by placing the Sales field on the Text shelf. Although the results are quite simple – a single number – this does qualify as a view. In other words, a Field (Sales) placed on a shelf (Text) has generated a Viz: Exercise – fundamentals of visualizations Let's explore the basics of creating a visualization via an exercise: Navigate to h t t p s ://p u b l i c . t a b l e a u . c o m /p r o f i l e /d a v i d 1. . b a l d w i n #!/ to locate and download the workbook associated with this chapter. In the workbook, find the tab labeled Fundamentals of Visualizations: Locate Region within the Dimensions portion of the Data pane: Drag Region to the Color shelf; that is, Region + Color shelf = what is shown in the following screenshot: Click on the Color shelf and then on Edit Colors… to adjust colors as desired: Next, move Region to the Size, Label/Text, Detail, Columns, and Rows shelves. After placing Region on each shelf, click on the shelf to access additional options. Lastly, choose other fields to drop on various shelves to continue exploring Tableau's behavior. As you continue exploring Tableau's behavior by dragging and dropping different fields onto different shelves, you will notice that Tableau responds with default behaviors. These defaults, however, can be overridden, which we will explore in the following section. Dashboard Creation Although, as stated previously, a dashboard contains one or more worksheets, dashboards are much more than static presentations. They are an essential part of Tableau's interactivity. In this section, we will populate a dashboard with worksheets and then deploy actions for interactivity. Exercise – building a dashboard In the workbook associated with this chapter, navigate to the tab entitled Building a Dashboard. Within the Dashboard pane located on the left-hand portion of the screen, double-click on each of the following worksheets (in the order in which they are listed) to add them to the dashboard: US Sales Customer Segment Scatter Plot Customers In the lower right-hand corner of the dashboard, click in the blank area below Profit Ratio to select the vertical container: After clicking in the blank area, you should see a blue border around the filter and the legends. This indicates that the vertical container is selected. As shown in the following screenshot, select the vertical container handle and drag it to the left-hand side of the Customers worksheet. Note the gray shading, which communicates where the container will be placed: The gray shading (provided by Tableau when dragging elements such as worksheets and containers onto a dashboard) helpfully communicates where the element will be placed. Take your time and observe carefully when placing an element on a dashboard or the results may be Unexpected. 5. Format the dashboard as desired. The following tips may prove helpful: Adjust the sizes of the elements on the screen by hovering over the edges between each element and then clicking and dragging as Desired. 2. Note that the Sales and Profit legends in the following screenshot are floating elements. Make an element float by right-clicking on the element handle and selecting Floating. (See the previous screenshot and note that the handle is located immediately above Region, in the upper-right-hand corner). 3. Create Horizontal and Vertical containers by dragging those objects from the bottom portion of the Dashboard pane. 4. Drag the edges of containers to adjust the size of each worksheet. 5. Display the dashboard title via Dashboard | Show Title…: If you enjoyed our post, be sure to check out Mastering Tableau which consists of many useful data visualization and data analysis techniques.  
Read more
  • 0
  • 0
  • 3350

article-image-how-to-build-a-music-recommendation-system-with-pagerank-algorithm
Vijin Boricha
13 Feb 2018
6 min read
Save for later

How to Build a music recommendation system with PageRank Algorithm

Vijin Boricha
13 Feb 2018
6 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book Mastering Spark for Data Science written by Andrew Morgan and Antoine Amend. In this book, you will learn about advanced Spark architectures, how to work with geographic data in Spark, and how to tune Spark algorithms to scale them linearly.[/box] In today’s tutorial, we will learn to build a recommender with PageRank algorithm. The PageRank algorithm Instead of recommending a specific song, we will recommend playlists. A playlist would consist of a list of all our songs ranked by relevance, most to least relevant. Let's begin with the assumption that people listen to music in a similar way to how they browse articles on the web, that is, following a logical path from link to link, but occasionally switching direction, or teleporting, and browsing to a totally different website. Continuing with the analogy, while listening to music one can either carry on listening to music of a similar style (and hence follow their most expected journey), or skip to a random song in a totally different genre. It turns out that this is exactly how Google ranks websites by popularity using a PageRank algorithm. For more details on the PageRank algorithm visit: h t t p ://i l p u b s . s t a n f o r d . e d u :8090/422/1/1999- 66. p d f . The popularity of a website is measured by the number of links it points to (and is referred from). In our music use case, the popularity is built as the number hashes a given song shares with all its neighbors. Instead of popularity, we introduce the concept of song commonality. Building a Graph of Frequency Co-occurrence We start by reading our hash values back from Cassandra and re-establishing the list of song IDs for each distinct hash. Once we have this, we can count the number of hashes for each song using a simple reduceByKey function, and because the audio library is relatively small, we collect and broadcast it to our Spark executors: val hashSongsRDD = sc.cassandraTable[HashSongsPair]("gzet", "hashes") val songHashRDD = hashSongsRDD flatMap { hash => hash.songs map { song => ((hash, song), 1) } } val songTfRDD = songHashRDD map { case ((hash,songId),count) => (songId, count) } reduceByKey(_+_) val songTf = sc.broadcast(songTfRDD.collectAsMap()) Next, we build a co-occurrence matrix by getting the cross product of every song sharing a same hash value, and count how many times the same tuple is observed. Finally, we wrap the song IDs and the normalized (using the term frequency we just broadcast) frequency count inside of an Edge class from GraphX: implicit class Crossable[X](xs: Traversable[X]) { def cross[Y](ys: Traversable[Y]) = for { x <- xs; y <- ys } yield (x, y) val crossSongRDD = songHashRDD.keys .groupByKey() .values .flatMap { songIds => songIds cross songIds filter { case (from, to) => from != to }.map(_ -> 1) }.reduceByKey(_+_) .map { case ((from, to), count) => val weight = count.toDouble / songTfB.value.getOrElse(from, 1) Edge(from, to, weight) }.filter { edge => edge.attr > minSimilarityB.value } val graph = Graph.fromEdges(crossSongRDD, 0L) We are only keeping edges with a weight (meaning a hash co-occurrence) greater than a predefined threshold in order to build our hash frequency graph. Running PageRank Contrary to what one would normally expect when running a PageRank, our graph is undirected. It turns out that for our recommender, the lack of direction does not matter, since we are simply trying to find similarities between Led Zeppelin and Spirit. A possible way of introducing direction could be to look at the song publishing date. In order to find musical influences, we could certainly introduce a chronology from the oldest to newest songs giving directionality to our edges. In the following pageRank, we define a probability of 15% to skip, or teleport as it is known, to any random song, but this can be obviously tuned for different needs: val prGraph = graph.pageRank(0.001, 0.15) Finally, we extract the page ranked vertices and save them as a playlist in Cassandra via an RDD of the Song case class: case class Song(id: Long, name: String, commonality: Double) val vertices = prGraph .vertices .mapPartitions { vertices => val songIds = songIdsB .value .vertices .map { case (songId, pr) => val songName = songIds.get(vId).get Song(songId, songName, pr) } } vertices.saveAsCassandraTable("gzet", "playlist") The reader may be pondering the exact purpose of PageRank here, and how it could be used as a recommender? In fact, our use of PageRank means that the highest ranking songs would be the ones that share many frequencies with other songs. This could be due to a common arrangement, key theme, or melody; or maybe because a particular artist was a major influence on a musical trend. However, these songs should be, at least in theory, more popular (by virtue of the fact they occur more often), meaning that they are more likely to have mass appeal. On the other end of the spectrum, low ranking songs are ones where we did not find any similarity with anything we know. Either these songs are so avant-garde that no one has explored these musical ideas before, or alternatively are so bad that no one ever wanted to copy them! Maybe they were even composed by that up-and-coming artist you were listening to in your rebellious teenage years. Either way, the chance of a random user liking these songs is treated as negligible. Surprisingly, whether it is a pure coincidence or whether this assumption really makes sense, the lowest ranked song from this particular audio library is Daft Punk's–Motherboard it is a title that is quite original (a brilliant one though) and a definite unique sound. To summarize, we have learnt how to build a complete recommendation system for a song playlist. You can check out the book Mastering Spark for Data Science to deep dive into Spark and deliver other production grade data science solutions. Read our post on how deep learning is revolutionizing the music industry. And here is how you can analyze big data using the pagerank algorithm.  
Read more
  • 0
  • 0
  • 5321

article-image-how-to-maintain-apache-mesos
Vijin Boricha
13 Feb 2018
6 min read
Save for later

How to maintain Apache Mesos

Vijin Boricha
13 Feb 2018
6 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by David Blomquist and Tomasz Janiszewski, titled Apache Mesos Cookbook. Throughout the course of the book, you will get to know tips and tricks along with best practices to follow when working with Mesos.[/box] In this article, we will learn about configuring logging options, setting up monitoring ecosystem, and upgrading your Mesos cluster. Logging and debugging Here we will configure logging options that will allow us to debug the state of Mesos. Getting ready We will assume Mesos is available on localhost port 5050. The steps provided here will work for either master or agents. How to do it... When Mesos is installed from pre-built packages, the logs are by default stored in /var/log/mesos/. When installing from a source build, storing logs is disabled by default. To change the log store location, we need to edit /etc/default/mesos and set the LOGS variable to the desired destination. For some reason, mesos-init-wrapper does not transfer the contents of /etc/mesos/log_dir to the --log_dir flag. That's why we need to set the log's destination in the environment variable. Remember that only Mesos logs will be stored there. Logs from third-party applications (for example, ZooKeeper) will still be sent to STDERR. Changing the default logging level can be done in one of two ways: by specifying the -- logging_level flag or by sending a request and changing the logging level at runtime for a specific period of time. For example, to change the logging level to INFO, just put it in the following code: /etc/mesos/logging_level echo INFO > /etc/mesos/logging_level The possible levels are INFO, WARNING, and ERROR. For example, to change the logging level to the most verbose for 15 minutes for debug purposes, we need to send the following request to the logging/toggle endpoint: curl -v -X POST localhost:5050/logging/toggle?level=3&duration=15mins How it works... Mesos uses the Google-glog library for debugging, but third-party dependencies such as ZooKeeper have their own logging solution. All configuration options are backed by glog and apply only to Mesos core code. Monitoring Now, we will set up monitoring for Mesos. Getting ready We must have a running monitoring ecosystem. Metrics storage could be a simple time- series database such as graphite, influxdb, or prometheus. In the following example, we are using graphite and our metrics are published with http://diamond.readthedocs.io/en/latest/. How to do it... Monitoring is enabled by default. Mesos does not provide any way to automatically push metrics to the registry. However, it exposes them as a JSON that can be periodically pulled and saved into the metrics registry:  Install Diamond using following command: pip install diamond  If additional packages are required to install them, run: sudo apt-get install python-pip python-dev build-essential. pip (Pip Installs Packages) is a Python package manager used to install software written in Python. Configure the metrics handler and interval. Open /etc/diamond/diamond.conf and ensure that there is a section for graphite configuration: [handler_graphite] class = handlers.GraphiteHandler host = <graphite.host> port = <graphite.port> Remember to replace graphite.host and graphite.port with real graphite details. Enable the default Mesos Collector. Create configuration files diamond-setup  - C MesosCollector. Check whether the configuration has proper values and edit them if needed. The configuration can be found in /etc/diamond/collectors/MesosCollector.conf. On master, this file should look like this: enabled = True host = localhost port = 5050 While on agent, the port could be different (5051), as follows: enabled = True host = localhost port = 5051 How it works... Mesos exposes metrics via the HTTP API. Diamond is a small process that periodically pulls metrics, parses them, and sends them to the metrics registry, in this case, graphite. The default implementation of Mesos Collector does not store all the available metrics so it's recommended to write a custom handler that will collect all the interesting information. See also... Metrics could be read from the following endpoints: http://mesos.apache.org/documentation/latest/endpoints/metrics/snapshot/ http://mesos.apache.org/documentation/latest/endpoints/slave/monitor/statistics/  http://mesos.apache.org/documentation/latest/endpoints/slave/state/ Upgrading Mesos In this recipe, you will learn how to upgrade your Mesos cluster. How to do it... Mesos release cadence is at least one release per quarter. Minor releases are backward compatible, although there could be some small incompatibilities or the dropping of deprecated methods. The recommended method of upgrading is to apply all intermediate versions. For example, to upgrade from 0.27.2 to 1.0.0, we should apply 0.28.0, 0.28.1, 0.28.2, and finally 1.0.0. If the agent's configuration changes, clearing the metadata directory is required. You can do this with the following code: rm -rv {MESOS_DIR}/metadata Here, {MESOS_DIR} should be replaced with the configured Mesos directory. Rolling upgrades is the preferred method of upgrading clusters, starting with masters and then agents. To minimize the impact on running tasks, if an agent's configuration changes and it becomes inaccessible, then it should be switched to maintenance mode. How it works... Configuration changes may require clearing the metadata because the changes may not be backward compatible. For example, when an agent runs with different isolators, it shouldn't attach to the already running processes without this isolator. The Mesos architecture will guarantee that the executors that were not attached to the Mesos agent will commit suicide after a configurable amount of time (--executor_registration_timeout). Maintenance mode allows you to declare the time window during which the agent will be inaccessible. When this occurs, Mesos will send a reverse offer to all the frameworks to drain that particular agent. The frameworks are responsible for shutting down its task and spawning it on another agent. The Maintenance mode is applied, even if the framework does not implement the HTTP API or is explicitly declined. Using maintenance mode can prevent restarting tasks multiple times. Consider the following example with five agents and one task, X. We schedule the rolling upgrade of all the agents. Task X is deployed on agent 1. When it goes down, it's moved to 2, then to 3, and so on. This approach is extremely inefficient because the task is restarted five times, but it only needs to be restarted twice. Maintenance mode enables the framework to optimally schedule the task to run on agent 5 when 1 goes down, and then return to 1 when 5 goes down: Worst case scenario of rolling upgrade without maintenance mode legend optimal solution of rolling upgrade with maintenance mode. We have learnt about running and maintaining Mesos. To know more about managing containers and understanding the scheduler API you may check out this book, Apache Mesos Cookbook.
Read more
  • 0
  • 0
  • 2054

article-image-hypothesis-testing-with-r
Richa Tripathi
13 Feb 2018
8 min read
Save for later

Hypothesis testing with R

Richa Tripathi
13 Feb 2018
8 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Learning Quantitative Finance with R written by Dr. Param Jeet and Prashant Vats. This book will help you understand the basics of R and how they can be applied in various Quantitative Finance scenarios.[/box] Hypothesis testing is used to reject or retain a hypothesis based upon the measurement of an observed sample. So in today’s tutorial we will discuss how to implement the various scenarios of hypothesis testing in R. Lower tail test of population mean with known variance The null hypothesis is given by where is the hypothesized lower bound of the population mean. Let us assume a scenario where an investor assumes that the mean of daily returns of a stock since inception is greater than $10. The average of 30 days' daily return sample is $9.9. Assume the population standard deviation is 0.011. Can we reject the null hypothesis at .05 significance level? Now let us calculate the test statistics z which can be computed by the following code in R: > xbar= 9.9 > mu0 = 10 > sig = 1.1 > n = 30 > z = (xbar-mu0)/(sig/sqrt(n)) > z Here: xbar: Sample mean mu: Hypothesized value sig: Standard deviation of population n: Sample size z: Test statistics This gives the value of z the test statistics: [1] -0.4979296 Now let us find out the critical value at 0.05 significance level. It can be computed by the following code: > alpha = .05 > z.alpha = qnorm(1-alpha) > -z.alpha This gives the following output: [1] -1.644854 Since the value of the test statistics is greater than the critical value, we fail to reject the null hypothesis claim that the return is greater than $10. In place of using the critical value test, we can use the pnorm function to compute the lower tail of Pvalue test statistics. This can be computed by the following code: > pnorm(z) This gives the following output: [1] 0.3092668 Since the Pvalue is greater than 0.05, we fail to reject the null hypothesis. Upper tail test of population mean with known variance The null hypothesis is given by  where  is the hypothesized upper bound of the population mean. Let us assume a scenario where an investor assumes that the mean of daily returns of a stock since inception is at most $5. The average of 30 days' daily return sample is $5.1. Assume the population standard deviation is 0.25. Can we reject the null hypothesis at .05 significance level? Now let us calculate the test statistics z, which can be computed by the following code in R: > xbar= 5.1 > mu0 = 5 > sig = .25 > n = 30 > z = (xbar-mu0)/(sig/sqrt(n)) > z Here: xbar: Sample mean mu0: Hypothesized value sig: Standard deviation of population n: Sample size z: Test statistics It gives 2.19089 as the value of test statistics. Now let us calculate the critical value at .05 significance level, which is given by the following code: > alpha = .05 > z.alpha = qnorm(1-alpha) > z.alpha This gives 1.644854, which is less than the value computed for the test statistics. Hence we reject the null hypothesis claim. Also, the Pvalue of the test statistics is given as follows: >pnorm(z, lower.tail=FALSE) This gives 0.01422987, which is less than 0.05 and hence we reject the null hypothesis. Two-tailed test of population mean with known variance The null hypothesis is given by  where  is the hypothesized value of the population mean. Let us assume a scenario where the mean of daily returns of a stock last year is $2. The average of 30 days' daily return sample is $1.5 this year. Assume the population standard deviation is .2. Can we reject the null hypothesis that there is not much significant difference in returns this year from last year at .05 significance level? Now let us calculate the test statistics z, which can be computed by the following code in R: > xbar= 1.5 > mu0 = 2 > sig = .1 > n = 30 > z = (xbar-mu0)/(sig/sqrt(n)) > z This gives the value of test statistics as -27.38613. Now let us try to find the critical value for comparing the test statistics at .05 significance level. This is given by the following code: >alpha = .05 >z.half.alpha = qnorm(1-alpha/2) >c(-z.half.alpha, z.half.alpha) This gives the value -1.959964, 1.959964. Since the value of test statistics is not between the range (-1.959964, 1.959964), we reject the claim of the null hypothesis that there is not much significant difference in returns this year from last year at .05 significance level. The two-tailed Pvalue statistics is given as follows: >2*pnorm(z) This gives a value less than .05 so we reject the null hypothesis. In all the preceding scenarios, the variance is known for population and we use the normal distribution for hypothesis testing. However, in the next scenarios, we will not be given the variance of the population so we will be using t distribution for testing the hypothesis. Lower tail test of population mean with unknown variance The null hypothesis is given by  where  is the hypothesized lower bound of the population mean. Let us assume a scenario where an investor assumes that the mean of daily returns of a stock since inception is greater than $1. The average of 30 days' daily return sample is $.9. Assume the population standard deviation is 0.01. Can we reject the null hypothesis at .05 significance level? In this scenario, we can compute the test statistics by executing the following code: > xbar= .9 > mu0 = 1 > sig = .1 > n = 30 > t = (xbar-mu0)/(sig/sqrt(n)) > t Here: xbar: Sample mean mu0: Hypothesized value sig: Standard deviation of sample n: Sample size t: Test statistics This gives the value of the test statistics as -5.477226. Now let us compute the critical value at .05 significance level. This is given by the following code: > alpha = .05 > t.alpha = qt(1-alpha, df=n-1) > -t.alpha We get the value as -1.699127. Since the value of the test statistics is less than the critical value, we reject the null hypothesis claim. Now instead of the value of the test statistics, we can use the Pvalue associated with the test statistics, which is given as follows: >pt(t, df=n-1) This results in a value less than .05 so we can reject the null hypothesis claim. Upper tail test of population mean with unknown variance The null hypothesis is given by where  is the hypothesized upper bound of the population mean. Let us assume a scenario where an investor assumes that the mean of daily returns of a stock since inception is at most $3. The average of 30 days' daily return sample is $3.1. Assume the population standard deviation is .2. Can we reject the null hypothesis at .05 significance level? Now let us calculate the test statistics t which can be computed by the following code in R: > xbar= 3.1 > mu0 = 3 > sig = .2 > n = 30 > t = (xbar-mu0)/(sig/sqrt(n)) > t Here: xbar: Sample mean mu0: Hypothesized value sig: Standard deviation of sample n: Sample size t: Test statistics This gives the value 2.738613 of the test statistics. Now let us find the critical value associated with the .05 significance level for the test statistics. It is given by the following code: > alpha = .05 > t.alpha = qt(1-alpha, df=n-1) > t.alpha Since the critical value 1.699127 is less than the value of the test statistics, we reject the null hypothesis claim. Also, the value associated with the test statistics is given as follows: >pt(t, df=n-1, lower.tail=FALSE) This is less than .05. Hence the null hypothesis claim gets rejected. Two tailed test of population mean with unknown variance The null hypothesis is given by , where  is the hypothesized value of the population mean. Let us assume a scenario where the mean of daily returns of a stock last year is $2. The average of 30 days' daily return sample is $1.9 this year. Assume the population standard deviation is .1. Can we reject the null hypothesis that there is not much significant difference in returns this year from last year at .05 significance level? Now let us calculate the test statistics t, which can be computed by the following code in R: > xbar= 1.9 > mu0 = 2 > sig = .1 > n = 30 > t = (xbar-mu0)/(sig/sqrt(n)) > t This gives -5.477226 as the value of the test statistics. Now let us try to find the critical value range for comparing, which is given by the following code: > alpha = .05 > t.half.alpha = qt(1-alpha/2, df=n-1) > c(-t.half.alpha, t.half.alpha) This gives the range value (-2.04523, 2.04523). Since this is the value of the test statistics, we reject the claim of the null hypothesis We learned how to practically perform one-tailed/ two-tailed hypothesis testing with known as well as unknown variance using R. If you enjoyed this excerpt, check out the book  Learning Quantitative Finance with R to explore different methods to manage risks and trading using Machine Learning with R.
Read more
  • 0
  • 0
  • 6975

article-image-tableau-data-handling-engine-offer
Amarabha Banerjee
13 Feb 2018
6 min read
Save for later

What Tableau Data Handling Engine has to offer

Amarabha Banerjee
13 Feb 2018
6 min read
[box type="note" align="" class="" width=""]This article is taken from the book Mastering Tableau, written by David Baldwin. This book will equip you with all the information needed to create effective dashboards and data visualization solutions using Tableau.[/box] In today’s tutorial, we shall explore the Tableau data handling engine and a real world example of how to use it. Tableau's data-handling engine is usually not well comprehended by even advanced authors because it's not an overt part of day-to-day activities; however, for the author who wants to truly grasp how to ready data for Tableau, this understanding is indispensable. In this section, we will explore Tableau's data-handling engine and how it enables structured yet organic data mining processes in the enterprise. To begin, let's clarify a term. The phrase Data-Handling Engine (DHE) in this context references how Tableau interfaces with and processes data. This interfacing and processing is comprised of three major parts: Connection, Metadata, and VizQL. Each part is described in detail in the following section. In other publications, Tableau's DHE may be referred to as a metadata model or the Tableau infrastructure. I've elected not to use either term because each is frequently defined differently in different contexts, which can be quite confusing. Tableau's DHE (that is, the engine for interfacing with and processing data) differs from other broadly considered solutions in the marketplace. Legacy business intelligence solutions often start with structuring the data for an entire enterprise. Data sources are identified, connections are established, metadata is defined, a model is created, and more. The upfront challenges this approach presents are obvious: highly skilled professionals, time-intensive rollout, and associated high startup costs. The payoff is a scalable, structured solution with detailed documentation and process control. Many next generation business intelligence platforms claim to minimize or completely do away with the need for structuring data. The upfront challenges are minimized: specialized skillsets are not required and the rollout time and associated startup costs are low. However, the initial honeymoon is short-lived, since the total cost of ownership advances significantly when difficulties are encountered trying to maintain and scale the solution. Tableau's infrastructure represents a hybrid approach, which attempts to combine the advantages of legacy business intelligence solutions with those of next-generation platforms, while minimizing the shortcomings of both. The philosophical underpinnings of Tableau's hybrid approach include the following: Infrastructure present in current systems should be utilized when advantageous Data models should be accessible by Tableau but not required DHE components as represented in Tableau should be easy to modify DHE components should be adjustable by business users The Tableau Data-Handling Engine The preceding diagram shows that the DHE consists of a run time module (VizQL) and two layers of abstraction (Metadata and Connection). Let's begin at the bottom of the graphic by considering the first layer of abstraction, Connection. The most fundamental aspect of the Connection is a path to the data source. The path should include attributes for the database, tables, and views as applicable. The Connection may also include joins, custom SQL, data-source filters, and more. In keeping with Tableau's philosophy of easy to modify and adjustable by business users (see the previous section), each of these aspects of the Connection is easily modifiable. For example, an author may choose to add an additional table to a join or modify a data-source filter. Note that the Connection does not contain any of the actual data. Although an author may choose to create a data extract based on data accessed by the Connection, that extract is separate from the connection. The next layer of abstraction is the metadata. The most fundamental aspect of the Metadata layer is the determination of each field as a measure or dimension. When connecting to relational data, Tableau makes the measure/dimension determination based on heuristics that consider the data itself as well as the data source's data types. Other aspects of the metadata include aliases, data types, defaults, roles, and more. Additionally, the Metadata layer encompasses author-generated fields such as calculations, sets, groups, hierarchies, bins, and so on. Because the Metadata layer is completely separate from the Connection layer, it can be used with other Connection layers; that is, the same metadata definitions can be used with different data sources. VizQL is generated when a user places a field on a shelf. The VizQL is then translated into Structured Query Language (SQL), Multidimensional Expressions(MDX), or Tableau Query Language (TQL) and passed to the backend data source via a driver. The following two aspects of the VizQL module are of primary importance: VizQL allows the author to change field attributions on the fly VizQL enables table calculations Let's consider each of these aspects of VizQL via examples: Changing field attribution example An analyst is considering infant mortality rates around the world. Using data from h t t p://d a t a . w o r l d b a n k . o r g /, they create the following worksheet by placing AVG(Infant Mortality Rate) and Country on the Columns and Rows shelves, respectively. AVG(Infant Mortality Rate) is, of course, treated as a measure in this case: Next they create a second worksheet to analyze the relationship between Infant Mortality Rate and Health Exp/Capita (that is, health expenditure per capita). In order to accomplish this, they define Infant Mortality Rate as a dimension, as shown in the following Screenshot: Studying the SQL generated by VizQL to create the preceding visualization is particularly Insightful: SELECT ['World Indicators$'].[Infant Mortality Rate] AS [Infant Mortality Rate], AVG(['World Indicators$'].[Health Exp/Capita]) AS [avg:Health Exp/Capita:ok] FROM [dbo].['World Indicators$'] ['World Indicators$'] GROUP BY ['World Indicators$'].[Infant Mortality Rate] The Group By clause clearly communicates that Infant Mortality Rate is treated as a dimension. The takeaway is to note that VizQL enabled the analyst to change the field usage from measure to dimension without adjusting the source metadata. This on-the-fly ability enables creative exploration of the data not possible with other tools and avoids lengthy exercises attempting to define all possible uses for each field. If you liked our article, be sure to check out Mastering Tableau which consists of more useful data visualization and data analysis techniques.  
Read more
  • 0
  • 0
  • 2217

article-image-estimating-population-statistics-point-estimation
Aaron Lazar
12 Feb 2018
5 min read
Save for later

Estimating population statistics with Point Estimation

Aaron Lazar
12 Feb 2018
5 min read
[box type="note" align="" class="" width=""]This article is an extract from the book Principles of Data Science, written by Sinan Ozdemir. The book is a great way to get into the field of data science. It takes a unique approach that bridges the gap between mathematics and computer science, taking you through the entire data science pipeline.[/box] In this extract, we’ll learn how to estimate population means, variances and other statistics using the Point Estimation method. For the code samples, we’ve used Python 2.7. A point estimate is an estimate of a population parameter based on sample data. To obtain these estimates, we simply apply the function that we wish to measure for our population to a sample of the data. For example, suppose there is a company of 9,000 employees and we are interested in ascertaining the average length of breaks taken by employees in a single day. As we probably cannot ask every single person, we will take a sample of the 9,000 people and take a mean of the sample. This sample mean will be our point estimate. The following code is broken into three parts: We will use the probability distribution, known as the Poisson distribution, to randomly generate 9,000 answers to the question: for how many minutes in a day do you usually take breaks? This will represent our "population". We will take a sample of 100 employees (using the Python random sample method) and find a point estimate of a mean (called a sample mean). Compare our sample mean (the mean of the sample of 100 employees) to our population mean. Let's take a look at the following code: np.random.seed(1234) long_breaks = stats.poisson.rvs(loc=10, mu=60, size=3000) # represents 3000 people who take about a 60 minute break The long_breaks variable represents 3000 answers to the question: how many minutes on an average do you take breaks for?, and these answers will be on the longer side. Let's see a visualization of this distribution, shown as follows: pd.Series(long_breaks).hist() We see that our average of 60 minutes is to the left of the distribution. Also, because we only sampled 3000 people, our bins are at their highest around 700-800 people. Now, let's model 6000 people who take, on an average, about 15 minutes' worth of breaks. Let's again use the Poisson distribution to simulate 6000 people, as shown: short_breaks = stats.poisson.rvs(loc=10, mu=15, size=6000) # represents 6000 people who take about a 15 minute break pd.Series(short_breaks).hist() Okay, so we have a distribution for the people who take longer breaks and a distribution for the people who take shorter breaks. Again, note how our average break length of 15 minutes falls to the left-hand side of the distribution, and note that the tallest bar is about 1600 people. breaks = np.concatenate((long_breaks, short_breaks)) # put the two arrays together to get our "population" of 9000 people The breaks variable is the amalgamation of all the 9000 employees, both long and short break takers. Let's see the entire distribution of people in a single visualization: pd.Series(breaks).hist() We see how we have two humps. On the left, we have our larger hump of people who take about a 15 minute break, and on the right, we have a smaller hump of people who take longer breaks. Later on, we will investigate this graph further. We can find the total average break length by running the following code: breaks.mean() # 39.99 minutes is our parameter Our average company break length is about 40 minutes. Remember that our population is the entire company's employee size of 9,000 people, and our parameter is 40 minutes. In the real world, our goal would be to estimate the population parameter because we would not have the resources to ask every single employee in a survey their average break length for many reasons. Instead, we will use a point estimate. So, to make our point, we want to simulate a world where we ask 100 random people about the length of their breaks. To do this, let's take a random sample of 100 employees out of the 9,000 employees we simulated, as shown: sample_breaks = np.random.choice(a = breaks, size=100) # taking a sample of 100 employees Now, let's take the mean of the sample and subtract it from the population mean and see how far off we were: breaks.mean() - sample_breaks.mean() # difference between means is 4.09 minutes, not bad! This is extremely interesting, because with only about 1% of our population (100 out of 9,000), we were able to get within 4 minutes of our population parameter and get a very accurate estimate of our population mean. Not bad! Here, we calculated a point estimate for the mean, but we can also do this for proportion parameters. By proportion, I am referring to a ratio of two quantitative values. Let's suppose that in a company of 10,000 people, our employees are 20% white, 10% black, 10% Hispanic, 30% Asian, and 30% identify as other. We will take a sample of 1,000 employees and see if their race proportions are similar. employee_races = (["white"]*2000) + (["black"]*1000) +         (["hispanic"]*1000) + (["asian"]*3000) +         (["other"]*3000) employee_races represents our employee population. For example, in our company of 10,000 people, 2,000 people are white (20%) and 3,000 people are Asian (30%). Let's take a random sample of 1,000 people, as shown: demo_sample = random.sample(employee_races, 1000) # Sample 1000 values for race in set(demo_sample): print( race + " proportion estimate:" ) print( demo_sample.count(race)/1000. ) The output obtained would be as follows: hispanic proportion estimate: 0.103 white proportion estimate: 0.192 other proportion estimate: 0.288 black proportion estimate: 0.1 asian proportion estimate: 0.317 We can see that the race proportion estimates are very close to the underlying population's proportions. For example, we got 10.3% for Hispanic in our sample and the population proportion for Hispanic was 10%. To summarize we can say that you’re familiar with point estimation method to estimate population means, variances and other statistics, and implement them in Python. If you found our post useful, you can check out Principles of Data Science for more interesting Data Science tips and techniques.
Read more
  • 0
  • 0
  • 4131
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
article-image-neural-network-architectures-101-understanding-perceptrons
Kunal Chaudhari
12 Feb 2018
9 min read
Save for later

Neural Network Architectures 101: Understanding Perceptrons

Kunal Chaudhari
12 Feb 2018
9 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Neural Network Programming with Java Second Edition written by Fabio M. Soares and Alan M. F. Souza. This book is for Java developers who want to master developing smarter applications like weather forecasting, pattern recognition etc using neural networks. [/box] In this article we will discuss about perceptrons along with their features, applications and limitations. Perceptrons are a very popular neural network architecture that implements supervised learning. Projected by Frank Rosenblatt in 1957, it has just one layer of neurons, receiving a set of inputs and producing another set of outputs. This was one of the first representations of neural networks to gain attention, especially because of their simplicity. In our Java implementation, this is illustrated with one neural layer (the output layer). The following code creates a perceptron with three inputs and two outputs, having the linear function at the output layer: int numberOfInputs=3; int numberOfOutputs=2; Linear outputAcFnc = new Linear(1.0); NeuralNet perceptron = new NeuralNet(numberOfInputs,numberOfOutputs,             outputAcFnc); Applications and limitations However, scientists did not take long to conclude that a perceptron neural network could only be applied to simple tasks, according to that simplicity. At that time, neural networks were being used for simple classification problems, but perceptrons usually failed when faced with more complex datasets. Let's illustrate this with a very basic example (an AND function) to understand better this issue. Linear separation The example consists of an AND function that takes two inputs, x1 and x2. That function can be plotted in a two-dimensional chart as follows: And now let's examine how the neural network evolves the training using the perceptron rule, considering a pair of two weights, w1 and w2, initially 0.5, and bias valued 0.5 as well. Assume learning rate η equals 0.2: Epoch x1 x2 w1 w2 b y t E Δw1 Δw2 Δb 1 0 0 0.5 0.5 0.5 0.5 0 -0.5 0 0 -0.1 1 0 1 0.5 0.5 0.4 0.9 0 -0.9 0 -0.18 -0.18 1 1 0 0.5 0.32 0.22 0.72 0 -0.72 -0.144 0 -0.144 1 1 1 0.356 0.32 0.076 0.752 1 0.248 0.0496 0.0496 0.0496 2 0 0 0.406 0.370 0.126 0.126 0 -0.126 0.000 0.000 -0.025 2 0 1 0.406 0.370 0.100 0.470 0 -0.470 0.000 -0.094 -0.094 2 1 0 0.406 0.276 0.006 0.412 0 -0.412 -0.082 0.000 -0.082 2 1 1 0.323 0.276 -0.076 0.523 1 0.477 0.095 0.095 0.095 … … 89 0 0 0.625 0.562 -0.312 -0.312 0 0.312 0 0 0.062 89 0 1 0.625 0.562 -0.25 0.313 0 -0.313 0 -0.063 -0.063 89 1 0 0.625 0.500 -0.312 0.313 0 -0.313 -0.063 0 -0.063 89 1 1 0.562 0.500 -0.375 0.687 1 0.313 0.063 0.063 0.063 After 89 epochs, we find the network to produce values near to the desired output. Since in this example the outputs are binary (zero or one), we can assume that any value produced by the network that is below 0.5 is considered to be 0 and any value above 0.5 is considered to be 1. So, we can draw a function , with the final weights and bias found by the learning algorithm w1=0.562, w2=0.5 and b=-0.375, defining the linear boundary in the chart: This boundary is a definition of all classifications given by the network. You can see that the boundary is linear, given that the function is also linear. Thus, the perceptron network is really suitable for problems whose patterns are linearly separable. The XOR case Now let's analyze the XOR case: We see that in two dimensions, it is impossible to draw a line to separate the two patterns. What would happen if we tried to train a single layer perceptron to learn this function? Suppose we tried, let's see what happened in the following table: Epoch x1 x2 w1 w2 b y t E Δw1 Δw2 Δb 1 0 0 0.5 0.5 0.5 0.5 0 -0.5 0 0 -0.1 1 0 1 0.5 0.5 0.4 0.9 1 0.1 0 0.02 0.02 1 1 0 0.5 0.52 0.42 0.92 1 0.08 0.016 0 0.016 1 1 1 0.516 0.52 0.436 1.472 0 -1.472 -0.294 -0.294 -0.294 2 0 0 0.222 0.226 0.142 0.142 0 -0.142 0.000 0.000 -0.028 2 0 1 0.222 0.226 0.113 0.339 1 0.661 0.000 0.132 0.132 2 1 0 0.222 0.358 0.246 0.467 1 0.533 0.107 0.000 0.107 2 1 1 0.328 0.358 0.352 1.038 0 -1.038 -0.208 -0.208 -0.208 … … 127 0 0 -0.250 -0.125 0.625 0.625 0 -0.625 0.000 0.000 -0.125 127 0 1 -0.250 -0.125 0.500 0.375 1 0.625 0.000 0.125 0.125 127 1 0 -0.250 0.000 0.625 0.375 1 0.625 0.125 0.000 0.125 127 1 1 -0.125 0.000 0.750 0.625 0 -0.625 -0.125 -0.125 -0.125 The perceptron just could not find any pair of weights that would drive the following error 0.625. This can be explained mathematically as we already perceived from the chart that this function cannot be linearly separable in two dimensions. So what if we add another dimension? Let's see the chart in three dimensions: In three dimensions, it is possible to draw a plane that would separate the patterns, provided that this additional dimension could properly transform the input data. Okay, but now there is an additional problem: how could we derive this additional dimension since we have only two input variables? One obvious, but also workaround, answer would be adding a third variable as a derivation from the two original ones. And being this third variable a (derivation), our neural network would probably get the following shape: Okay, now the perceptron has three inputs, one of them being a composition of the other. This also leads to a new question: how should that composition be processed? We can see that this component could act as a neuron, so giving the neural network a nested architecture. If so, there would another new question: how would the weights of this new neuron be trained, since the error is on the output neuron? Multi-layer perceptrons As we can see, one simple example in which the patterns are not linearly separable has led us to more and more issue using the perceptron architecture. That need led to the application of multilayer perceptrons. The fact that the natural neural network is structured in layers as well, and each layer captures pieces of information from a specific environment is already established. In artificial neural networks, layers of neurons act in this way, by extracting and abstracting information from data, transforming them into another dimension or shape. In the XOR example, we found the solution to be the addition of a third component that would make possible a linear separation. But there remained a few questions regarding how that third component would be computed. Now let's consider the same solution as a two-layer perceptron: Now we have three neurons instead of just one, but in the output the information transferred by the previous layer is transformed into another dimension or shape, whereby it would be theoretically possible to establish a linear boundary on those data points. However, the question on finding the weights for the first layer remains unanswered, or can we apply the same training rule to neurons other than the output? We are going to deal with this issue in the Generalized delta rule section. MLP properties Multi-layer perceptrons can have any number of layers and also any number of neurons in each layer. The activation functions may be different on any layer. An MLP network is usually composed of at least two layers, one for the output and one hidden layer. There are also some references that consider the input layer as the nodes that collect input data; therefore, for those cases, the MLP is considered to have at least three layers. For the purpose of this article, let's consider the input layer as a special type of layer which has no weights, and as the effective layers, that is, those enabled to be trained, we'll consider the hidden and output layers. A hidden layer is called that because it actually hides its outputs from the external world. Hidden layers can be connected in series in any number, thus forming a deep neural network. However, the more layers a neural network has, the slower would be both training and running, and according to mathematical foundations, a neural network with one or two hidden layers at most may learn as well as deep neural networks with dozens of hidden layers. But it depends on several factors. MLP weights In an MLP feedforward network, one particular neuron i receives data from a neuron j of the previous layer and forwards its output to a neuron k of the next layer: The mathematical description of a neural network is recursive: Here, yo is the network output (should we have multiple outputs, we can replace yo with Y, representing a vector); fo is the activation function of the output; l is the number of hidden layers; nhi is the number of neurons in the hidden layer i; wi is the weight connecting the i th neuron of the last hidden layer to the output; fi is the activation function of the neuron i; and bi is the bias of the neuron i. It can be seen that this equation gets larger as the number of layers increases. In the last summing operation, there will be the inputs xi. Recurrent MLP The neurons on an MLP may feed signals not only to neurons in the next layers (feedforward network), but also to neurons in the same or previous layers (feedback or recurrent). This behavior allows the neural network to maintain state on some data sequence, and this feature is especially exploited when dealing with time series or handwriting recognition. Recurrent networks are usually harder to train, and eventually the computer may run out of memory while executing them. In addition, there are recurrent network architectures better than MLPs, such as Elman, Hopfield, Echo state, Bidirectional RNNs (recurrent neural networks). But we are not going to dive deep into these architectures. Coding an MLP Bringing these concepts into the OOP point of view, we can review the classes already designed so far: One can see that the neural network structure is hierarchical. A neural network is composed of layers that are composed of neurons. In the MLP architecture, there are three types of layers: input, hidden, and output. So suppose that in Java, we would like to define a neural network consisting of three inputs, one output (linear activation function) and one hidden layer (sigmoid function) containing five neurons. The resulting code would be as follows: int numberOfInputs=3; int numberOfOutputs=1; int[] numberOfHiddenNeurons={5};     Linear outputAcFnc = new Linear(1.0); Sigmoid hiddenAcFnc = new Sigmoid(1.0); NeuralNet neuralnet = new NeuralNet(numberOfInputs, numberOfOutputs, numberOfHiddenNeurons, hiddenAcFnc, outputAcFnc); To summarize, we saw how perceptrons can be applied to solve linear separation problems, their limitations in classifying nonlinear data and how to suppress those limitations with multi-layer perceptrons (MLPs). If you enjoyed this excerpt, check out the book Neural Network Programming with Java Second Edition for a better understanding of neural networks and how they fit in different real-world projects.
Read more
  • 0
  • 0
  • 3506

article-image-null-30
Packt
09 Feb 2018
13 min read
Save for later

Lets build applications for wear 2.0

Packt
09 Feb 2018
13 min read
In this article Ashok Kumar S, the author of the article Android Wear Projects will get you started on writing android wear applications. You probably already know that by the title of the article that we will be building wear applications, But you can also expect a little bit of story on every project and comprehensive explanation on the components and structure of the application. We will be covering most of the wear 2.0 standards and development practices in all the projects we are building. Why building wear applications? The culture of wearing a utility that helps to do certain actions have always been part of a modern civilization. Wrist watches for human beings have become an augmented helping tool for checking time and date. Wearing a watch lets you check time with just a glance. Technology has taken this wearing watch experience to next level, The first modern wearable watch was a combination of calculator and watch introduced to the world in 1970. Over the decades of advancement in microprocessors and wireless technology lead to introduce a concept called "ubiquitous computing". During this time most of the leading electronic industry, start-ups have started to work on their ideas which made wearable devices very popular. Going forward we will be building five projects enlisted below. Note taking application Fitness application Wear Maps application Chat messenger Watch face (For more resources related to this topic, see here.) This article will also introduce you to setting up your wear application development environment and the best practices for wear application development and new user interface components and we will also be exploring firebase technologies for chatting and notifications in one of the projects. Publishing wear application to play store will follow completely the similar procedure to mobile application publishing with a little change. Moving forward the article will help you to set your mind so resolutely with determined expectation towards accomplishing wear applications which are introduced in the article. If you are beginning to write the wear app for the first time or you have a fair bit of information on wear application development and struggling to how to get started, this article is going to be a lot of helpful resource for sure.   Note taking application There are numerous ways to take notes. You could carry a notebook and pen in your pocket, or scribble thoughts on a piece of paper. Or, better yet, you could use your Android Wear to take notes, so one can always have a way to store thoughts even if there's not a pen and note nearby. The Note Taking App provides convenient access to store and retrieve notes within Android Wear device. There are many Android smartphone notes taking app which is popular for their simplicity and elegant functionality. For the scope of a wear device, it is necessary to keep the design simple and glanceable. As a software developer we need to understand how important it is to target the different device sizes and reaching out for various types devices, To solve this android wearable support library has component called BoxInsetLayout. And having an animated feedback through DelayedConfirmationView is implemented to give the user a feedback on task completion. Thinking of a good wear application design, Google recommends using dark color to wear application for the best battery efficiency, Light color schemes used in typical material designed mobile applications are not energy efficient in wear devices. Light colors are less energy efficient in OLED display's. Light colors need to light up the pixels with brighter intensity, White colors need to light up the RGB diodes in your pixels at 100%, the more white and light color in application the less battery efficient application will be. Using custom font’s, In the world of digital design, making your application's visuals easy on users eyes is important. The Lora font from google collections has well-balanced contemporary serif with roots in calligraphy. It is a text typeface with moderate contrast well suited for body text. A paragraph set in Lora will make a memorable appearance because of its brushed curves in contrast with driving serifs. The overall typographic voice of Lora perfectly conveys the mood of a modern-day story, or an art essay. Technically Lora is optimized for screen appearance. We will also make the application to have list item animation so that users will love to use the application often. Fitness application We are living in the realm of technology! It is definitely not the highlight here, We are also living in the realm of intricate lifestyles that driving everyone's health into some sort of illness. Our existence leads us back to roots of the ocean, we all know that we are beings who have been evolved from water. If we trace back we clearly understand our body composition is made of sixty percent water and rest are muscled water. When we talk about taking care of our health we miss simple things, Considering taking care and self-nurturing us, should start from drinking sufficient water. Adequate regular water consumption will ensure great metabolism and healthy functional organs. New millennium's Advancement in technologies is an expression of how one can make use of technology for doing right things. Android wear integrates numerous sensors which can be used to help android wear users to measure their heart rate and step counts and etc. Having said that how about writing an application that reminds us to drink water every thirty minutes and measures our heart rate, step counts and few health tips. Material design will drive the mobile application development in vertical heights in this article we will be learning the wear navigation drawer and other material design components that makes the application stand out. The application will track the step counts through step counter sensor, application also has the ability to check the heart pulse rate with a animated heart beat projection, Application reminds user on hydrate alarms which in tern make the app user to drink the water often. In this article we will build a map application with a quick note taking ability on the layers of map. We humans travel to different cities, It could be domestic or international cities. How about having a track of places visited. We all use maps for different reasons but in most cases, we use maps to plan a particular activity like outdoor tours and cycling and other similar activities. Maps influence human's intelligence to find the fastest route from the source location to destination location.   Fetching the address from latitude and longitude using Geocoder class is comprehensively explained. The map applications needs to have certain visual attractions and that is carried out in this project and the story is explained comprehensively. Chatting application We could state that the belief system of Social media has been advanced and wiped out many difficulties of communication. Just about a couple of decades back, the communication medium was letters and a couple of centuries back trained birds and if we still look back we will definitely get few more stories to comprehend the way people use to communicate back those days. Now we are in the generation of IoT, wearable smart devices and era of smartphones where the communication happens across the planet in the fraction of a second. We will build a mobile and wear application that exhibits the power of google wear messaging API's to assist us in building the chat application. with a wear companion application to administer and respond to the messages being received. To help the process of chatting, the article introduces Firebase technologies for accomplishing the chatting application. We will build a wear and mobile app together and we will receive the message typed from wear device to mobile and update it to firebase. The article comprehensively explains the firebase real-time database and for notification purpose article introduces firebase functions as well. The application will have the user login page and list users to chat and chatting screen, The project is conceptualised in such a way that the article reader will be able to leverage all these techniques in his mastering skills and he can use the same ability in production applications. Data layer establishes the communication channel between two android nodes, And the article talks about the process in detail, in this project reader will be able to find whether the device has the google play services if not how to install or go forward to use the application. A brief explanation about capability API and some of the best use cases in chatting application context. Notification have always been the important component in the chatting application to know who texted them. In this article the reader will be able to understand the firebase functions for sending push notifications. Firebase functions offers triggers for all the firebase technologies and we will explore realtime database triggers from firebase functions. Reader will also learn how to work with input method framework and voice inpute in the wear device. After all the reader will be able to understand the essentials of writing a chat application with wear app companion. Watch Face A watch face, also known as the dial is part of the clock that displays the time through fixed numbers with moving hands. This expression of checking time can be designed with various artistic approaches and creativity. In this article reader will be able to start writing their own watch face. The article comprehensively explains the CanvaswatchFaceService with all the callbacks for constructing a digital watchface. The article also talks about registering watch face to the manifest similar to the wallpaperservice class. Keeping in mind that the watch face is going to be used in different form factors in wear device the watch face is written. The wear 2.0 offers watch face picker feature for setting up the watch face from the available list of watch faces. Reader will understand the watch face elements of Analog and digital watch faces. There are certain common issues when we talk about watch faces like how the watch face is going to get the data if at all it has any complications. Battery efficiency is one of the major concern while choose to write watch face. How well the network related operations is done in the watch face and what are the sensors if at all watch face is using how often the watch face have the access. Custom assets in watch face like complex SVG animations and graphical animations and how much CPU and GPU cycles is used, user’s like the more visual attractive watch faces rather than a simple analog and digital watch faces android wear 2.0 allows complications straight from the wear 2.0 SDK developers need not do the logical part of getting the data. The article also talks about Interactive watch face, the trend changes every time, In wear 2.0 the new interactive watch faces which can have unique interaction and style expression is a great update. And all watch face developers for wear might have to start thinking of interactive watch faces. The idea is to have the user to like and love watch face by giving them a delightful and useful information on a timely basis which changes the user experience of the watch face. Data integrated watch faces. Making watch face is an excellent artistic engineering, What data we should express in the watch face and how time data and date data is being displayed. More about wear 2.0 In this article reader will be able to seek the features that wear 2.0 offers and they will also be able to know wear 2.0 is the prominent update with the plenty of new features bundled, including google assistant, stand-alone applications, new watch faces and support for the third party complications. Wear 2.0 offers to give more with the happening market research and Google is working with partner companies to build the powerful ecosystem for wear. Stand-alone applications in wear is a brilliant feature which will create a lot of buzz in wear developers and wear users. Afterall who wants always to carry phone and who wants a paired devices to do some simple tasks. Stand-alone application is the powerful feature of the wear ecosystem. How cool it will be using wear apps without your phone nearby. There are various scenarios that wear devices use to be phone dependent, for example to receive new email notification wear needs to be connected to phone for the internet, Now wear device can independently connect to wifi and can sync all the apps for new updates. User can now complete more tasks with wear apps without a phone paired to it. The article explains how to identify whether the application is stand-alone or it is dependent on a companion app and installing stand-alone applications from google playstore and other wear 2.0 related new changes. Like Watch face complications and watch face picker. Brief understanding about the storage mechanism of the stand-alone wear applications. Wear device needs to talk with phone in many use cases and the article talks about the advertising the availability of and the article also talks about advertising the capabilities of the device and retrieving the capable nodes for the requested capability. If the wear app as a companion app we need to detect the companion app in the phone or in the wear device if neither the app installed we can guide the user to playstore to install the companion application. Wear 2.0 supports cloud messaging and cloud based push notifications and the article have the comprehensive explanation about the notifications. Android wear is evolving in every way, in wear 1.0 switching between screens use to be tedious and confusing to wear users. Now, google has introduced material design and interactive drawers. Which includes single and multipage navigation drawer and action drawer and more. Typing in the tiny little one and half inches screen is pretty challenging task to the wear users so wear 2.0 introduces input method framework quick types and swipes for entering the input in the wear device directly. This article is a resourceful journey to those who are planning to seek the wear development and wear 2.0 standards. Everything in the article projects a usual task that most of the developers trying to accomplish. Resources for Article:   Further resources on this subject: Getting started with Android Development [article] Building your first Android Wear Application [article] The Art of Android Development Using Android Studio [article]
Read more
  • 0
  • 0
  • 2080

article-image-implementing-simple-time-series-data-analysis-r
Amarabha Banerjee
09 Feb 2018
4 min read
Save for later

Implementing a simple Time Series Data Analysis in R

Amarabha Banerjee
09 Feb 2018
4 min read
[box type="note" align="" class="" width=""]This article is extracted from the book Machine Learning with R written by Brett Lantz. This book will methodically take you through stages to apply machine learning for data analysis using R.[/box] In this article, we will explore the popular time series analysis method and its practical implementation using R. Introduction When we think about time, we think about years, days, months, hours, minutes, and seconds. Think of any datasets and you will find some attributes which will be in the form of time, especially data related to stock, sales, purchase, profit, and loss. All these have time associated with them. For example, the price of stock in the stock exchange at different points on a given day or month or year. Think of any industry domain, and sales are an important factor; you can see time series in sales, discounts, customers, and so on. Other domains include but are not limited to statistics, economics and budgets, processes and quality control, finance, weather forecasting, or any kind of forecasting, transport, logistics, astronomy, patient study, census analysis, and the list goes on. In simple words, it contains data or observations in time order, spaced at equal intervals. Time series analysis means finding the meaning in the time-related data to predict what will happen next or forecast trends on the basis of observed values. There are many methods to fit the time series, smooth the random variation, and get some insights from the dataset. When you look at time series data you can see the following: Trend: Long term increase or decrease in the observations or data. Pattern: Sudden spike in sales due to christmas or some other festivals, drug consumption increases due to some condition; this type of data has a fixed time duration and can be predicted for future time also. Cycle: Can be thought of as a pattern that is not fixed; it rises and falls without any pattern. Such time series involve a great fluctuation in data. How to do There are many datasets available with R that are of the time series types. Using the command class, one can know if the dataset is time series or not. We will look into the AirPassengers dataset that shows monthly air passengers in thousands from 1949 to 1960. We will also create new time series to represent the data. Perform the following commands in RStudio or R Console: > class(AirPassengers) Output: [1] "ts" > start(AirPassengers) Output: [1] 1949 1 > end(AirPassengers) Output: [1] 1960 12 > summary(AirPassengers) Output: Min. 1st Qu. Median Mean 3rd Qu. Max. 104.0 180.0 265.5 280.3 360.5 622.0 Analyzing Time Series Data [ 89 ] In the next recipe, we will create the time series and print it out. Let's think of the share price of some company in the range of 2,500 to 4,000 from 2011 to be recorded monthly. Perform the following coding in R: > my_vector = sample(2500:4000, 72, replace=T) > my_series = ts(my_vector, start=c(2011,1), end=c(2016,12), frequency = 12) > my_series Output: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2011 2888 3894 3675 3113 3421 3870 2644 2677 3392 2847 2543 3147 2012 2973 3538 3632 2695 3475 3971 2695 2963 3217 2836 3525 2895 2013 3984 3811 2902 3602 3812 3631 2625 3887 3601 2581 3645 3324 2014 3830 2821 3794 3942 3504 3526 3932 3246 3787 2894 2800 2732 2015 3326 3659 2993 2765 3881 3983 3813 3172 2667 3517 3445 2805 2016 3668 3948 2779 2881 3285 2733 3203 3329 3854 3285 3800 2563 How it works In the first recipe, we used the AirPassengers dataset, using the class function. We saw that it is ts (ts stands for time series). The start and end functions will give the starting year and ending year of the dataset with the values. The frequency function tells us the interval of observations; 1 means annually, 4 means quarterly, 12 means yearly, and so on. In the next recipe, we want to generate samples between 2,500 to 40,000 to represent the price of a share. Using a sample function, we can create a sample; it takes the range as the first argument, and the number of samples required as the second argument. The last argument decides whether duplication is to be allowed in the sample or not. We stored the sample in the my_vector. Now we create a time series using the ts function. The ts function takes the vector as an argument followed by the start and end to show the period for which the time series is being constructed. The frequency specifies the number of observations in the start and end to be recorded. 12. To summarize we talked about how R can be utilized to perform time series analysis in different ways. If you would like to learn more useful machine learning techniques in R, be sure to check out Machine Learning with R.      
Read more
  • 0
  • 0
  • 2943

article-image-explaining-data-exploration-in-under-a-minute
Amarabha Banerjee
08 Feb 2018
5 min read
Save for later

Explaining Data Exploration in under a minute

Amarabha Banerjee
08 Feb 2018
5 min read
[box type="note" align="" class="" width=""]Below given article is taken from the book Machine Learning with R written by Brett Lantz. This book will help you harness the power of R for statistical computing and data science.[/box] Today, we shall explore different data exploration techniques and a real world example of using these techniques. Introduction Data Exploration is a term used for finding insightful information from data. To find insights from data various steps such as data munging, data analysis, data modeling, and model evaluation are taken. In any real data exploration project, commonly six steps are involved in the exploration process. They are as follows: Asking the right questions: Asking the right questions will help in understanding the objective and target information sought from the data. Questions can be asked such as What are my expected findings after the exploration is finished?, or What kind of information can I extract through the exploration? Data collection: Once the right questions have been asked the target of exploration is cleared. Data collected from various sources is in unorganized and diverse format. Data may come from various sources such as files, databases, internet, and so on. Data collected in this way is raw data and needs to be processed to extract meaningful information. Most of the analysis and visualizing tools or applications expect data to be in a certain format to generate results and hence the raw data is of no use for them. Data munging: Raw data collected needs to be converted into the desired format of the tools to be used. In this phase, raw data is passed through various processes such as parsing the data, sorting, merging, filtering, dealing with missing values, and so on. The main aim is to transform raw data in the format that the analyzing and visualizing tools understand. Once the data is compatible with the tools, analysis and visualizing tools are used to generate the different results. Basic exploratory data analysis: Once the data munging is done and data is formating for the tools, it can be used to perform data exploration and analysis. Tools provide various methods and techniques to do the same. Most analyzing tools allow statistical functions to be performed on the data. Visualizing tools help in visualizing the data in different ways. Using basic statistical operations and visualizing the same data can be understood in better way. Advanced exploratory data analysis: Once the basic analysis is done it's time to look at an advanced stage of analysis. In this stage, various prediction models are formed on basis of requirement. Machine learning algorithms are utilized to train the model and generate the inferences. Various tuning on the model is also done to ensure correctness and effectiveness of the model. Model assessment: When the models are mare, they are evaluated to find the best model from the given different models. The major factor to decide the best model is to see how perfect or closely it can predict the values. Models are tuned here also for increasing the accuracy and effectiveness. Various plots and graphs are used to see the model’s prediction. Real world example - using Air Quality Dataset Air quality datasets come bundled with R. They contain data about the New York Air Quality Measurements of 1973 for five months from May to September recorded daily. To view all the available datasets use the data() function, it will display all the datasets available with R installation. How to do it Perform the following step to see all the datasets in R and using airquality: > data() > str(airquality) Output 'data.frame': 153 obs. of 6 variables: $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ... $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ... $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ... $ Temp : int 67 72 74 62 56 66 65 59 61 69 ... $ Month : int 5 5 5 5 5 5 5 5 5 5 ... $ Day : int 1 2 3 4 5 6 7 8 9 10 ... > head(airquality) Output Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6 How it works The str command is used to display the structure of the dataset, as you can see it contains the information about the observation of ozone, solar, wind, and temp attributes recorded each day for five months. Using the head function, you can see the first few lines of actual data. The dataset is very basic and is enough to start processing and analyzing data at a very basic level. Kaggle website, which has various diverse kinds of datasets. Apart from datasets it also holds many competitions in data science fields to solve real-world problems. You can find the competitions, datasets, kernels, and jobs at https://www. kaggle.com/. Many competitions are organized by large corporate bodies, government agencies, or from academia. Many of the competitions have prize money associated with them. The following screenshot shows competitions and prize money. You can simply create an account and start participating in competitions by submitting code and the output and the same will be assessed. Assessment or evaluation criteria is available on the detail page of each competition. By participating and using https:/ / www.kaggle. com/ one gains experience in solving real-world problems. It gives you a taste of what data scientist do. On the jobs page various jobs for data scientists and analysis is listed and you can apply if the profile is suitable or matches with your interests. If you liked our post, be sure to check out Machine Learning with R which consists of more useful machine learning techniques with R.  
Read more
  • 0
  • 0
  • 1165
article-image-building-linear-regression-model-python-developers
Pravin Dhandre
07 Feb 2018
7 min read
Save for later

Building a Linear Regression Model in Python for developers

Pravin Dhandre
07 Feb 2018
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by Rodolfo Bonnin titled Machine Learning for Developers. This book is a systematic guide for developers to implement various Machine Learning techniques and develop efficient and intelligent applications.[/box] Let’s start using one of the most well-known toy datasets, explore it, and select one of the dimensions to learn how to build a linear regression model for its values. Let's start by importing all the libraries (scikit-learn, seaborn, and matplotlib); one of the excellent features of Seaborn is its ability to define very professional-looking style settings. In this case, we will use the whitegrid style: import numpy as np from sklearn import datasets import seaborn.apionly as sns %matplotlib inline import matplotlib.pyplot as plt sns.set(style='whitegrid', context='notebook') The Iris Dataset It’s time to load the Iris dataset. This is one of the most well-known historical datasets. You will find it in many books and publications. Given the good properties of the data, it is useful for classification and regression examples. The Iris dataset (https://archive.ics.uci.edu/ml/datasets/Iris) contains 50 records for each of the three types of iris, 150 lines in a total over five fields. Each line is a measurement of the following: Sepal length in cm Sepal width in cm Petal length in cm Petal width in cm The final field is the type of flower (setosa, versicolor, or virginica). Let’s use the load_dataset method to create a matrix of values from the dataset: iris2 = sns.load_dataset('iris') In order to understand the dependencies between variables, we will implement the covariance operation. It will receive two arrays as parameters and will return the covariance(x,y) value: def covariance (X, Y): xhat=np.mean(X) yhat=np.mean(Y) epsilon=0 for x,y in zip (X,Y): epsilon=epsilon+(x-xhat)*(y-yhat) return epsilon/(len(X)-1) Let's try the implemented function and compare it with the NumPy function. Note that we calculated cov (a,b), and NumPy generated a matrix of all the combinations cov(a,a), cov(a,b), so our result should be equal to the values (1,0) and (0,1) of that matrix: print (covariance ([1,3,4], [1,0,2])) print (np.cov([1,3,4], [1,0,2])) 0.5 [[ 2.33333333   0.5              ] [ 0.5                   1.                ]] Having done a minimal amount of testing of the correlation function as defined earlier, receive two arrays, such as covariance, and use them to get the final value: def correlation (X, Y): return (covariance(X,Y)/(np.std(X,    ddof=1)*np.std(Y,   ddof=1))) ##We have to indicate ddof=1 the unbiased std Let’s test this function with two sample arrays, and compare this with the (0,1) and (1,0) values of the correlation matrix from NumPy: print (correlation ([1,1,4,3], [1,0,2,2])) print (np.corrcoef ([1,1,4,3], [1,0,2,2])) 0.870388279778 [[ 1.                     0.87038828] [ 0.87038828   1.                ]] Getting an intuitive idea with Seaborn pairplot A very good idea when starting worke on a problem is to get a graphical representation of all the possible variable combinations. Seaborn’s pairplot function provides a complete graphical summary of all the variable pairs, represented as scatterplots, and a representation of the univariate distribution for the matrix diagonal. Let’s look at how this plot type shows all the variables dependencies, and try to look for a linear relationship as a base to test our regression methods: sns.pairplot(iris2, size=3.0) <seaborn.axisgrid.PairGrid at 0x7f8a2a30e828> Pairplot of all the variables in the dataset. Lets' select two variables that, from our initial analysis, have the property of being linearly dependent. They are petal_width and petal_length: X=iris2['petal_width'] Y=iris2['petal_length'] Let’s now take a look at this variable combination, which shows a clear linear tendency: plt.scatter(X,Y) This is the representation of the chosen variables, in a scatter type graph: This is the current distribution of data that we will try to model with our linear prediction function. Creating the prediction function First, let's define the function that will abstractedly represent the modeled data, in the form of a linear function, with the form y=beta*x+alpha: def predict(alpha, beta, x_i): return beta * x_i + alpha Defining the error function It’s now time to define the function that will show us the difference between predictions and the expected output during training. We have two main alternatives: measuring the absolute difference between the values (or L1), or measuring a variant of the square of the difference (or L2). Let’s define both versions, including the first formulation inside the second: def error(alpha, beta, x_i, y_i): #L1 return y_i - predict(alpha, beta, x_i) def sum_sq_e(alpha, beta, x, y): #L2 return sum(error(alpha, beta, x_i, y_i) ** 2 for x_i, y_i in zip(x, y)) Correlation fit Now, we will define a function implementing the correlation method to find the parameters for our regression: def correlation_fit(x, y): beta = correlation(x, y) * np.std(y, ddof=1) / np.std(x,ddof=1) alpha = np.mean(y) - beta * np.mean(x) return alpha, beta Let’s then run the fitting function and print the guessed parameters: alpha, beta = correlation_fit(X, Y) print(alpha) print(beta) 1.08355803285 2.22994049512 Let’s now graph the regressed line with the data in order to intuitively show the appropriateness of the solution: plt.scatter(X,Y) xr=np.arange(0,3.5) plt.plot(xr,(xr*beta)+alpha) This is the final plot we will get with our recently calculated slope and intercept: Final regressed line Polynomial regression and an introduction to underfitting and overfitting When looking for a model, one of the main characteristics we look for is the power of generalizing with a simple functional expression. When we increase the complexity of the model, it's possible that we are building a model that is good for the training data, but will be too optimized for that particular subset of data. Underfitting, on the other hand, applies to situations where the model is too simple, such as this case, which can be represented fairly well with a simple linear model. In the following example, we will work on the same problem as before, using the scikit- learn library to search higher-order polynomials to fit the incoming data with increasingly complex degrees. Going beyond the normal threshold of a quadratic function, we will see how the function looks to fit every wrinkle in the data, but when we extrapolate, the values outside the normal range are clearly out of range: from sklearn.linear_model import Ridge from sklearn.preprocessing import PolynomialFeatures from sklearn.pipeline import make_pipeline ix=iris2['petal_width'] iy=iris2['petal_length'] # generate points used to represent the fitted function x_plot = np.linspace(0, 2.6, 100) # create matrix versions of these arrays X = ix[:, np.newaxis] X_plot = x_plot[:, np.newaxis] plt.scatter(ix, iy, s=30, marker='o', label="training points") for count, degree in enumerate([3, 6, 20]): model = make_pipeline(PolynomialFeatures(degree), Ridge()) model.fit(X, iy) y_plot = model.predict(X_plot) plt.plot(x_plot, y_plot, label="degree %d" % degree) plt.legend(loc='upper left') plt.show() The combined graph shows how the different polynomials' coefficients describe the data population in different ways. The 20 degree polynomial shows clearly how it adjusts perfectly for the trained dataset, and after the known values, it diverges almost spectacularly, going against the goal of generalizing for future data. Curve fitting of the initial dataset, with polynomials of increasing values With this, we successfully explored how to develop an efficient linear regression model in Python and how you can make predictions using the designed model. We've reviewed ways to identify and optimize the correlation between the prediction and the expected output using simple and definite functions. If you enjoyed our post, you must check out Machine Learning for Developers to uncover advanced tools for building machine learning applications on your fingertips.  
Read more
  • 0
  • 1
  • 6426

article-image-how-to-create-a-neural-network-in-tensorflow
Aaron Lazar
06 Feb 2018
8 min read
Save for later

How to Create a Neural Network in TensorFlow

Aaron Lazar
06 Feb 2018
8 min read
[box type="note" align="" class="" width=""]This article has been extracted from the book Principles of Data Science authored by Sinan Ozdemir. With a unique approach that bridges the gap between mathematics and computer science, the books takes you through the entire data science pipeline. Beginning with cleaning and preparing data, and effective data mining strategies and techniques to help you get to grips with machine learning.[/box] In this article, we’re going to learn how to create a neural network whose goal will be to classify images. Tensorflow is an open-source machine learning module that is used primarily for its simplified deep learning and neural network abilities. I would like to take some time to introduce the module and solve a few quick problems using tensorflow. Let’s begin with some imports: from sklearn import datasets, metrics import tensorflow as tf import numpy as np from sklearn.cross_validation import train_test_split %matplotlib inline Loading our iris dataset: # Our data set of iris flowers iris = datasets.load_iris() # Load datasets and split them for training and testing X_train, X_test, y_train, y_test = train_test_split(iris.data, iris. target) Creating the Neural Network: # Specify that all features have real-value datafeature_columns = [tf.contrib.layers.real_valued_column("", dimension=4)] optimizer = tf.train.GradientDescentOptimizer(learning_rate=.1) # Build 3 layer DNN with 10, 20, 10 units respectively. classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns, hidden_units=[10, 20, 10], optimizer=optimizer, n_classes=3) # Fit model. classifier.fit(x=X_train, y=y_train, steps=2000) Notice that our code really hasn't changed from the last segment. We still have our feature_columns from before, but now we introduce, instead of a linear classifier, a DNNClassifier, which stands for Deep Neural Network Classifier. This is TensorFlow's syntax for implementing a neural network. Let's take a closer look: tf.contrib.learn.DNNClassifier(feature_columns=feature_columns, hidden_units=[10, 20, 10], optimizer=optimizer, n_classes=3) We see that we are inputting the same feature_columns, n_classes, and optimizer, but see how we have a new parameter called hidden_units? This list represents the number of nodes to have in each layer between the input and the output layer. All in all, this neural network will have five layers: The first layer will have four nodes, one for each of the iris feature variables. This layer is the input layer. A hidden layer of 10 nodes. A hidden layer of 20 nodes. A hidden layer of 10 nodes. The final layer will have three nodes, one for each possible outcome of the network. This is called our output layer. Now that we've trained our model, let's evaluate it on our test set: # Evaluate accuracy. accuracy_score = classifier.evaluate(x=X_test, y=y_test)["accuracy"] print('Accuracy: {0:f}'.format(accuracy_score)) Accuracy: 0.921053 Hmm, our neural network didn't do so well on this dataset, but perhaps it is because the network is a bit too complicated for such a simple dataset. Let's introduce a new dataset that has a bit more to it… The MNIST dataset consists of over 50,000 handwritten digits (0-9) and the goal is to recognize the handwritten digits and output which letter they are writing. Tensorflow has a built-in mechanism for downloading and loading these images. from tensorflow.examples.tutorials.mnist import input_data mnist = input_data.read_data_sets("MNIST_data/", one_hot=False) Extracting MNIST_data/train-images-idx3-ubyte.gz Extracting MNIST_data/train-labels-idx1-ubyte.gz Extracting MNIST_data/t10k-images-idx3-ubyte.gz Extracting MNIST_data/t10k-labels-idx1-ubyte.gz Notice that one of our inputs for downloading mnist is called one_hot. This parameter either brings in the dataset's target variable (which is the digit itself) as a single number or has a dummy variable. For example, if the first digit were a 7, the target would either be: 7: If one_hot was false 0 0 0 0 0 0 0 1 0 0: If one_hot was true (notice that starting from 0, the seventh index is a 1) We will encode our target the former way, as this is what our tensorflow neural network and our sklearn logistic regression will expect. The dataset is split up already into a training and test set, so let's create new variables to hold them: x_mnist = mnist.train.images y_mnist = mnist.train.labels.astype(int) For the y_mnist variable, I specifically cast every target as an integer (by default they come in as floats) because otherwise tensorflow would throw an error at us. Out of curiosity, let's take a look at a single image: import matplotlib.pyplot as plt plt.imshow(x_mnist[10].reshape(28, 28)) And hopefully our target variable matches at the 10th index as well: y_mnist[10] 0 Excellent! Let's now take a peek at how big our dataset is: x_mnist.shape (55000, 784) y_mnist.shape (55000,) Our training size then is 55000 images and target variables. Let's fit a deep neural network to our images and see if it will be able to pick up on the patterns in our inputs: # Specify that all features have real-value data feature_columns = [tf.contrib.layers.real_valued_column("", dimension=784)] optimizer = tf.train.GradientDescentOptimizer(learning_rate=.1) # Build 3 layer DNN with 10, 20, 10 units respectively. classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns,     hidden_units=[10, 20, 10],   optimizer=optimizer, n_classes=10) # Fit model. classifier.fit(x=x_mnist,       y=y_mnist,       steps=1000) # Warning this is veryyyyyyyy slow This code is very similar to our previous segment using DNNClassifier; however, look how in our first line of code, I have changed the number of columns to be 784 while in the classifier itself, I changed the number of output classes to be 10. These are manual inputs that tensorflow must be given to work. The preceding code runs very slowly. It is little by little adjusting itself in order to get the best possible performance from our training set. Of course, we know that the ultimate test here is testing our network on an unknown test set, which is also given to us from tensorflow: x_mnist_test = mnist.test.images y_mnist_test = mnist.test.labels.astype(int) x_mnist_test.shape (10000, 784) y_mnist_test.shape (10000,) So we have 10,000 images to test on; let's see how our network was able to adapt to the dataset: # Evaluate accuracy. accuracy_score = classifier.evaluate(x=x_mnist_test, y=y_mnist_test)["accuracy"] print('Accuracy: {0:f}'.format(accuracy_score)) Accuracy: 0.920600 Not bad, 92% accuracy on our dataset. Let's take a second and compare this performance to a standard sklearn logistic regression now: logreg = LogisticRegression() logreg.fit(x_mnist, y_mnist) # Warning this is slow y_predicted = logreg.predict(x_mnist_test) from sklearn.metrics import accuracy_score # predict on our test set, to avoid overfitting! accuracy = accuracy_score(y_predicted, y_mnist_test) # get our accuracy score Accuracy 0.91969 Success! Our neural network performed better than the standard logistic regression. This is likely because the network is attempting to find relationships between the pixels themselves and using these relationships to map them to what digit we are writing down. In logistic regression, the model assumes that every single input is independent of one another, and therefore has a tough time finding relationships between them. There are ways of making our neural network learn differently: We could make our network wider, that is, increase the number of nodes in the hidden layers instead of having several layers of a smaller number of nodes: # A wider network feature_columns = [tf.contrib.layers.real_valued_column("", dimension=784)] optimizer = tf.train.GradientDescentOptimizer(learning_rate=.1) # Build 3 layer DNN with 10, 20, 10 units respectively. classifier = tf.contrib.learn.DNNClassifier(feature_ columns=feature_columns,      hidden_units=[1500],       optimizer=optimizer,    n_classes=10) # Fit model. classifier.fit(x=x_mnist,       y=y_mnist,       steps=100) # Warning this is veryyyyyyyy slow # Evaluate accuracy. accuracy_score = classifier.evaluate(x=x_mnist_test,    y=y_mnist_test)["accuracy"] print('Accuracy: {0:f}'.format(accuracy_score)) Accuracy: 0.898400 We could increase our learning rate, forcing the network to attempt to converge into an answer faster. As mentioned before, we run the risk of the model skipping the answer entirely if we go down this route. It is usually better to stick with a smaller learning rate. We can change the method of optimization. Gradient descent is very popular; however, there are other algorithms for doing so. One example is called the Adam Optimizer. The difference is in the way they traverse the error function, and therefore the way that they approach the optimization point. Different problems in different domains call for different optimizers. There is no replacement for a good old fashioned feature selection phase instead of attempting to let the network figure everything out for us. We can take the time to find relevant and meaningful features that actually will allow our network to find an answer quicker! There you go! You’ve now learned how to build a neural net in Tensorflow! If you liked this tutorial and would like to learn more, head over and grab the copy Principles of Data Science. If you want to take things a bit further and learn how to classify Irises using multi-layer perceptrons, head over here.    
Read more
  • 0
  • 0
  • 4465

article-image-comparing-different-dotnet-products
Mark Price
06 Feb 2018
6 min read
Save for later

Comparing .NET products

Mark Price
06 Feb 2018
6 min read
This is an extract from the third edition of C# 7.1 and .NET Core 2.0 - Modern Cross-Platform Development by Mark Price.  Understanding the .NET framework Microsoft's .NET Framework is a development platform that includes a Common Language Runtime (CLR) that manages the execution of code, and provides a rich library of classes to build applications. Microsoft designed .NET Framework to have the possibility of being cross-platform, but Microsoft put their implementation effort into making it work best with Windows. Practically speaking, .NET Framework is Windows-only, and a legacy platform. What are Mono and Xamarin? Third parties developed a .NET implementation named the Mono project that you can read more about here. Mono is cross-platform, but it fell well behind the official implementation of .NET Framework. It has found a niche as the foundation of the Xamarin mobile platform. Microsoft purchased Xamarin in 2016 and now gives away what used to be an expensive Xamarin extension for free with Visual Studio 2017. Microsoft renamed the Xamarin Studio development tool to Visual Studio for Mac, and has given it the ability to create ASP.NET Core Web API services. Xamarin is targeted at mobile development and building cloud services to support mobile apps. What is .NET Core? Today, we live in a truly cross-platform world. Modern mobile and cloud development has made Windows a much less important operating system. So, Microsoft has been working on an effort to decouple .NET from its close ties with Windows. While rewriting .NET to be truly cross-platform, Microsoft has taken the opportunity to refactor .NET to remove major parts that are no longer considered core. This new product is branded as .NET Core, which includes a cross-platform implementation of the CLR known as CoreCLR, and a streamlined library of classes known as CoreFX. Scott Hunter, Microsoft Partner Director Program Manager for .NET, says, "Forty percent of our .NET Core customers are brand-new developers to the platform, which is what we want with .NET Core. We want to bring new people in." The following table shows when important versions of .NET Core were released, and Microsoft's schedule for the next major release: Version Released .NET Core RC1 November 2015 .NET Core 1.0 June 2016 .NET Core 1.1 November 2016 .NET Core 1.0.4 and .NET Core 1.1.1 March 2017 .NET Core 2.0 August 2017 .NET Core for UWP in Windows 10 Fall Creators Update October 2017 .NET Core 2.1 Q1 2018 .NET Core is much smaller than the current version of .NET Framework because a lot has been removed. For example, Windows Forms and Windows Presentation Foundation (WPF) can be used to build graphical user interface (GUI) applications, but they are tightly bound to Windows, so they have been removed from .NET Core. The latest technology used to build Windows apps is Universal Windows Platform (UWP), and UWP is built on a custom version of .NET Core. ASP.NET Web Forms and Windows Communication Foundation (WCF) are old web application and service technologies that fewer developers choose to use for new development projects today, so they have also been removed from .NET Core. Instead, developers prefer to use ASP.NET MVC and ASP.NET Web API. These two technologies have been refactored and combined into a new product that runs on .NET Core, named ASP.NET Core. The Entity Framework (EF) 6 is an object-relational mapping technology to work with data stored in relational databases such as Oracle and Microsoft SQL Server. It has gained baggage over the years, so the cross-platform version has been slimmed down and named Entity Framework Core. In addition to removing large pieces from .NET Framework to make .NET Core, Microsoft has componentized .NET Core into NuGet packages: small chunks of functionality that can be deployed independently. Microsoft's primary goal is not to make .NET Core smaller than .NET Framework. The goal is to componentize .NET Core to support modern technologies and to have fewer dependencies, so that deployment requires only those packages that your application needs. What is .NET standard? The situation with .NET today is that there are three forked .NET platforms, all controlled by Microsoft: .NET Framework .NET Core Xamarin Each have different strengths and weaknesses because they are designed for different scenarios. This has led to the problem that a developer must learn three platforms, each with annoying quirks and limitations. So, Microsoft defined .NET Standard 2.0: a specification for a set of APIs that all .NET platforms must implement. You cannot install .NET Standard 2.0 in the same way that you cannot install HTML5. To use HTML5, you must install a web browser that implements the HTML5 specification. To use .NET Standard 2.0, you must install a .NET platform that implements the .NET Standard 2.0 specification. .NET Standard 2.0 is implemented by the latest versions of .NET Framework, .NET Core, and Xamarin. .NET Standard 2.0 makes it much easier for developers to share code between any flavor of .NET. For .NET Core 2.0, this adds many of the missing APIs that developers need to port old code written for .NET Framework to the cross-platform .NET Core. However, some APIs are implemented, but throw an exception to indicate to a developer that they should not actually be used! This is usually due to differences in the operating system on which you run .NET Core. .NET Native Another .NET initiative is .NET Native. This compiles C# code to native CPU instructions ahead-of-time (AoT), rather than using the CLR to compile intermediate language (IL) code just-in-time (JIT) to native code later. .NET Native improves execution speed and reduces the memory footprint for applications. It supports the following: UWP apps for Windows 10, Windows 10 Mobile, Xbox One, HoloLens, and Internet of Things (IoT) devices such asRaspberry Pi Server-side web development with ASP.NET Core Console applications for use on the command line Comparing different .NET tools Technology Feature set Compiles to Host OSes .NET framework Both legacy and modern IL code Windows only Xamarin Mobile only IL code iOS, Android, Windows Mobile .NET Core Modern only IL code Windows, macOS, Linux .NET Native Modern only Native code  Windows, macOS, Linux Thanks for reading this extract from C# 7.1 and .NET Core 2.0 - Modern Cross Platform Development! If you want to learn more about .NET, dive into Mark Price's book, or explore more .NET resources here.
Read more
  • 0
  • 0
  • 2399
article-image-implementing-fault-tolerance-in-spark-streaming-data-processing-applications-with-apache-kafka
Pravin Dhandre
01 Feb 2018
16 min read
Save for later

Implementing fault-tolerance in Spark Streaming data processing applications with Apache Kafka

Pravin Dhandre
01 Feb 2018
16 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Rajanarayanan Thottuvaikkatumana titled Apache Spark 2 for Beginners. This book is a developer’s guide for developing large-scale and distributed data processing applications in their business environment. [/box] Data processing is generally carried in two ways, either in batch or stream processing. This article will help you learn how to start processing your data uninterruptedly and build fault-tolerance as and when the data gets generated in real-time Message queueing systems with publish-subscribe capability are generally used for processing messages. The traditional message queueing systems failed to perform because of the huge volume of messages to be processed per second for the needs of large-scale data processing applications. Kafka is a publish-subscribe messaging system used by many IoT applications to process a huge number of messages. The following capabilities of Kafka made it one of the most widely used messaging systems: Extremely fast: Kafka can process huge amounts of data by handling reading and writing in short intervals of time from many application clients Highly scalable: Kafka is designed to scale up and scale out to form a cluster using commodity hardware Persists a huge number of messages: Messages reaching Kafka topics are persisted into the secondary storage, while at the same time it is handling huge number of messages flowing through The following are some of the important elements of Kafka, and are terms to be understood before proceeding further: Producer: The real source of the messages, such as weather sensors or mobile phone network Broker: The Kafka cluster, which receives and persists the messages published to its topics by various producers Consumer: The data processing applications subscribed to the Kafka topics that consume the messages published to the topics The same log event processing application use case discussed in the preceding section is used again here to elucidate the usage of Kafka with Spark Streaming. Instead of collecting the log event messages from the TCP socket, here the Spark Streaming data processing application will act as a consumer of a Kafka topic and the messages published to the topic will be consumed. The Spark Streaming data processing application uses the version 0.8.2.2 of Kafka as the message broker, and the assumption is that the reader has already installed Kafka, at least in a standalone mode. The following activities are to be performed to make sure that Kafka is ready to process the messages produced by the producers and that the Spark Streaming data processing application can consume those messages: Start the Zookeeper that comes with Kafka installation. Start the Kafka server. Create a topic for the producers to send the messages to. Pick up one Kafka producer and start publishing log event messages to the newly created topic. Use the Spark Streaming data processing application to process the log eventspublished to the newly created topic. Starting Zookeeper and Kafka The following scripts are run from separate terminal windows in order to start Zookeeper and the Kafka broker, and to create the required Kafka topics: $ cd $KAFKA_HOME $ $KAFKA_HOME/bin/zookeeper-server-start.sh $KAFKA_HOME/config/zookeeper.properties [2016-07-24 09:01:30,196] INFO binding to port 0.0.0.0/0.0.0.0:2181 (org.apache.zookeeper.server.NIOServerCnxnFactory) $ $KAFKA_HOME/bin/kafka-server-start.sh $KAFKA_HOME/config/server.properties [2016-07-24 09:05:06,381] INFO 0 successfully elected as leader (kafka.server.ZookeeperLeaderElector) [2016-07-24 09:05:06,455] INFO [Kafka Server 0], started (kafka.server.KafkaServer) $ $KAFKA_HOME/bin/kafka-topics.sh --create --zookeeper localhost:2181 -- replication-factor 1 --partitions 1 --topic sfb Created topic "sfb". $ $KAFKA_HOME/bin/kafka-console-producer.sh --broker-list localhost:9092 -- topic sfb The Kafka message producer can be any application capable of publishing messages to the Kafka topics. Here, the kafka-console-producer that comes with Kafka is used as the producer of choice. Once the producer starts running, whatever is typed into its console window will be treated as a message that is published to the chosen Kafka topic. The Kafka topic is given as a command line argument when starting the kafka-console-producer. The submission of the Spark Streaming data processing application that consumes log event messages produced by the Kafka producer is slightly different from the application covered in the preceding section. Here, many Kafka jar files are required for the data processing. Since they are not part of the Spark infrastructure, they have to be submitted to the Spark cluster. The following jar files are required for the successful running of this application: $KAFKA_HOME/libs/kafka-clients-0.8.2.2.jar $KAFKA_HOME/libs/kafka_2.11-0.8.2.2.jar $KAFKA_HOME/libs/metrics-core-2.2.0.jar $KAFKA_HOME/libs/zkclient-0.3.jar Code/Scala/lib/spark-streaming-kafka-0-8_2.11-2.0.0-preview.jar Code/Python/lib/spark-streaming-kafka-0-8_2.11-2.0.0-preview.jar In the preceding list of jar files, the maven repository co-ordinate for spark-streamingkafka-0-8_2.11-2.0.0-preview.jar is "org.apache.spark" %% "sparkstreaming-kafka-0-8" % "2.0.0-preview". This particular jar file has to be downloaded and placed in the lib folder of the directory structure given in Figure 4. It is being used in the submit.sh and the submitPy.sh scripts, which submit the application to the Spark cluster. The download URL for this jar file is given in the reference section of this chapter. In the submit.sh and submitPy.sh files, the last few lines contain a conditional statement looking for the second parameter value of 1 to identify this application and ship the required jar files to the Spark cluster. Implementing the application in Scala The following code snippet is the Scala code for the log event processing application that processes the messages produced by the Kafka producer. The use case of this application is the same as the one discussed in the preceding section concerning windowing operations: /** The following program can be compiled and run using SBT Wrapper scripts have been provided with this The following script can be run to compile the code ./compile.sh The following script can be used to run this application in Spark. The  second command line argument of value 1 is very important. This is to flag the shipping of the kafka jar files to the Spark cluster ./submit.sh com.packtpub.sfb.KafkaStreamingApps 1 **/ package com.packtpub.sfb import java.util.HashMap import org.apache.spark.streaming._ import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.streaming.kafka._ import org.apache.kafka.clients.producer.{ProducerConfig, KafkaProducer, ProducerRecord} object KafkaStreamingApps { def main(args: Array[String]) { // Log level settings LogSettings.setLogLevels() // Variables used for creating the Kafka stream //The quorum of Zookeeper hosts val zooKeeperQuorum = "localhost" // Message group name val messageGroup = "sfb-consumer-group" //Kafka topics list separated by coma if there are multiple topics to be listened on val topics = "sfb" //Number of threads per topic val numThreads = 1 // Create the Spark Session and the spark context val spark = SparkSession .builder .appName(getClass.getSimpleName) .getOrCreate() // Get the Spark context from the Spark session for creating the streaming context val sc = spark.sparkContext // Create the streaming context val ssc = new StreamingContext(sc, Seconds(10)) // Set the check point directory for saving the data to recover when there is a crash ssc.checkpoint("/tmp") // Create the map of topic names val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap // Create the Kafka stream val appLogLines = KafkaUtils.createStream(ssc, zooKeeperQuorum, messageGroup, topicMap).map(_._2) // Count each log messge line containing the word ERROR val errorLines = appLogLines.filter(line => line.contains("ERROR")) // Print the line containing the error errorLines.print() // Count the number of messages by the windows and print them errorLines.countByWindow(Seconds(30), Seconds(10)).print() // Start the streaming ssc.start() // Wait till the application is terminated ssc.awaitTermination() } } Compared to the Scala code in the preceding section, the major difference is in the way the stream is created. Implementing the application in Python The following code snippet is the Python code for the log event processing application that processes the message produced by the Kafka producer. The use case of this application is also the same as the one discussed in the preceding section concerning windowing operations: # The following script can be used to run this application in Spark # ./submitPy.sh KafkaStreamingApps.py 1 from __future__ import print_function import sys from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils if __name__ == "__main__": # Create the Spark context sc = SparkContext(appName="PythonStreamingApp") # Necessary log4j logging level settings are done log4j = sc._jvm.org.apache.log4j log4j.LogManager.getRootLogger().setLevel(log4j.Level.WARN) # Create the Spark Streaming Context with 10 seconds batch interval ssc = StreamingContext(sc, 10) # Set the check point directory for saving the data to recover when there is a crash ssc.checkpoint("tmp") # The quorum of Zookeeper hosts zooKeeperQuorum="localhost" # Message group name messageGroup="sfb-consumer-group" # Kafka topics list separated by coma if there are multiple topics to be listened on topics = "sfb" # Number of threads per topic numThreads = 1 # Create a Kafka DStream kafkaStream = KafkaUtils.createStream(ssc, zooKeeperQuorum, messageGroup, {topics: numThreads}) # Create the Kafka stream appLogLines = kafkaStream.map(lambda x: x[1]) # Count each log messge line containing the word ERROR errorLines = appLogLines.filter(lambda appLogLine: "ERROR" in appLogLine) # Print the first ten elements of each RDD generated in this DStream to the console errorLines.pprint() errorLines.countByWindow(30,10).pprint() # Start the streaming ssc.start() # Wait till the application is terminated ssc.awaitTermination() The following commands are run on the terminal window to run the Scala application: $ cd Scala $ ./submit.sh com.packtpub.sfb.KafkaStreamingApps 1 The following commands are run on the terminal window to run the Python application: $ cd Python $ ./submitPy.sh KafkaStreamingApps.py 1 When both of the preceding programs are running, whatever log event messages are typed into the console window of the Kafka console producer, and invoked using the following command and inputs, will be processed by the application. The outputs of this program will be very similar to the ones that are given in the preceding section: $ $KAFKA_HOME/bin/kafka-console-producer.sh --broker-list localhost:9092 -- topic sfb [Fri Dec 20 01:46:23 2015] [ERROR] [client 1.2.3.4.5.6] Directory index forbidden by rule: /home/raj/ [Fri Dec 20 01:46:23 2015] [WARN] [client 1.2.3.4.5.6] Directory index forbidden by rule: /home/raj/ [Fri Dec 20 01:54:34 2015] [ERROR] [client 1.2.3.4.5.6] Directory index forbidden by rule: /apache/web/test Spark provides two approaches to process Kafka streams. The first one is the receiver-based approach that was discussed previously and the second one is the direct approach. This direct approach to processing Kafka messages is a simplified method in which Spark Streaming is using all the possible capabilities of Kafka just like any of the Kafka topic consumers, and polls for the messages in the specific topic, and the partition by the offset number of the messages. Depending on the batch interval of the Spark Streaming data processing application, it picks up a certain number of offsets from the Kafka cluster, and this range of offsets is processed as a batch. This is highly efficient and ideal for processing messages with a requirement to have exactly-once processing. This method also reduces the Spark Streaming library's need to do additional work to implement the exactly-once semantics of the message processing and delegates that responsibility to Kafka. The programming constructs of this approach are slightly different in the APIs used for the data processing. Consult the appropriate reference material for the details. The preceding sections introduced the concept of a Spark Streaming library and discussed some of the real-world use cases. There is a big difference between Spark data processing applications developed to process static batch data and those developed to process dynamic stream data in a deployment perspective. The availability of data processing applications to process a stream of data must be constant. In other words, such applications should not have components that are single points of failure. The following section is going to discuss this topic. Spark Streaming jobs in production When a Spark Streaming application is processing the incoming data, it is very important to have uninterrupted data processing capability so that all the data that is getting ingested is processed. In business-critical streaming applications, most of the time missing even one piece of data can have a huge business impact. To deal with such situations, it is important to avoid single points of failure in the application infrastructure. From a Spark Streaming application perspective, it is good to understand how the underlying components in the ecosystem are laid out so that the appropriate measures can be taken to avoid single points of failure. A Spark Streaming application deployed in a cluster such as Hadoop YARN, Mesos or Spark Standalone mode has two main components very similar to any other type of Spark application: Spark driver: This contains the application code written by the user Executors: The executors that execute the jobs submitted by the Spark driver But the executors have an additional component called a receiver that receives the data getting ingested as a stream and saves it as blocks of data in memory. When one receiver is receiving the data and forming the data blocks, they are replicated to another executor for fault-tolerance. In other words, in-memory replication of the data blocks is done onto a different executor. At the end of every batch interval, these data blocks are combined to form a DStream and sent out for further processing downstream. Figure 1 depicts the components working together in a Spark Streaming application infrastructure deployed in a cluster: In Figure 1, there are two executors. The receiver component is deliberately not displayed in the second executor to show that it is not using the receiver and instead just collects the replicated data blocks from the other executor. But when needed, such as on the failure of the first executor, the receiver in the second executor can start functioning. Implementing fault-tolerance in Spark Streaming data processing applications Spark Streaming data processing application infrastructure has many moving parts. Failures can happen to any one of them, resulting in the interruption of the data processing. Typically failures can happen to the Spark driver or the executors. When an executor fails, since the replication of data is happening on a regular basis, the task of receiving the data stream will be taken over by the executor on which the data was getting replicated. There is a situation in which when an executor fails, all the data that is unprocessed will be lost. To circumvent this problem, there is a way to persist the data blocks into HDFS or Amazon S3 in the form of write-ahead logs. When the Spark driver fails, the driven program is stopped, all the executors lose connection, and they stop functioning. This is the most dangerous situation. To deal with this situation, some configuration and code changes are necessary. The Spark driver has to be configured to have an automatic driver restart, which is supported by the cluster managers. This includes a change in the Spark job submission method to have the cluster mode in whichever may be the cluster manager. When a restart of the driver happens, to start from the place when it crashed, a checkpointing mechanism has to be implemented in the driver program. This has already been done in the code samples that are used. The following lines of code do that job: ssc = StreamingContext(sc, 10) ssc.checkpoint("tmp") From an application coding perspective, the way the StreamingContext is created is slightly different. Instead of creating a new StreamingContext every time, the factory method getOrCreate of the StreamingContext is to be used with a function, as shown in the following code segment. If that is done, when the driver is restarted, the factory method will check the checkpoint directory to see whether an earlier StreamingContext was in use, and, if found in the checkpoint data, it is created. Otherwise, a new StreamingContext is created. The following code snippet gives the definition of a function that can be used with the getOrCreate factory method of the StreamingContext. As mentioned earlier, a detailed treatment of these aspects is beyond the scope of this book: /** * The following function has to be used when the code is being restructured to have checkpointing and driver recovery * The way it should be used is to use the StreamingContext.getOrCreate with this function and do a start of that */ def sscCreateFn(): StreamingContext = { // Variables used for creating the Kafka stream // The quorum of Zookeeper hosts val zooKeeperQuorum = "localhost" // Message group name val messageGroup = "sfb-consumer-group" //Kafka topics list separated by coma if there are multiple topics to be listened on val topics = "sfb" //Number of threads per topic val numThreads = 1 // Create the Spark Session and the spark context val spark = SparkSession .builder .appName(getClass.getSimpleName) .getOrCreate() // Get the Spark context from the Spark session for creating the streaming context val sc = spark.sparkContext // Create the streaming context val ssc = new StreamingContext(sc, Seconds(10)) // Create the map of topic names val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap // Create the Kafka stream val appLogLines = KafkaUtils.createStream(ssc, zooKeeperQuorum, messageGroup, topicMap).map(_._2) // Count each log messge line containing the word ERROR val errorLines = appLogLines.filter(line => line.contains("ERROR")) // Print the line containing the error errorLines.print() // Count the number of messages by the windows and print them errorLines.countByWindow(Seconds(30), Seconds(10)).print() // Set the check point directory for saving the data to recover when there is a crash ssc.checkpoint("/tmp") // Return the streaming context ssc } At a data source level, it is a good idea to build parallelism for faster data processing and, depending on the source of data, this can be accomplished in different ways. Kafka inherently supports partition at the topic level, and that kind of scaling out mechanism supports a good amount of parallelism. As a consumer of Kafka topics, the Spark Streaming data processing application can have multiple receivers by creating multiple streams, and the data generated by those streams can be combined by the union operation on the Kafka streams. The production deployment of Spark Streaming data processing applications is to be done purely based on the type of application that is being used. Some of the guidelines given previously are just introductory and conceptual in nature. There is no silver bullet approach to solving production deployment problems, and they have to evolve along with the application development. To summarize, we looked at the production deployment of Spark Streaming data processing applications and the possible ways of implementing fault-tolerance in Spark Streaming and data processing applications using Kafka. To explore more critical and equally important Spark tools such as Spark GraphX, Spark MLlib, DataFrames etc, do check out Apache Spark 2 for Beginners  to develop efficient large-scale applications with Apache Spark.  
Read more
  • 0
  • 0
  • 7145

article-image-how-to-run-spark-in-mesos
Sunith Shetty
31 Jan 2018
6 min read
Save for later

How to run Spark in Mesos

Sunith Shetty
31 Jan 2018
6 min read
This article is an excerpt from a book written by Muhammad Asif Abbasi titled Learning Apache Spark 2. In this book, you will learn how to perform big data analytics using Spark streaming, machine learning techniques and more. From the article given below, you will learn how to operate Spark in Mesos cluster manager. What is Mesos? Mesos is an open source cluster manager started as a UC Berkley research project in 2008 and quite widely used by a number of organizations. Spark supports Mesos, and Matei Zahria has given a keynote at Mesos Con in June of 2016. Here is a link to the YouTube video of the keynote. Before you start If you haven't installed Mesos previously, the getting started page on the Apache website gives a good walk through of installing Mesos on Windows, MacOS, and Linux. Follow the URL https://mesos.apache.org/getting-started/. Once installed you need to start-up Mesos on your cluster Starting Mesos Master: ./bin/mesos-master.sh -ip=[MasterIP] -workdir=/var/lib/mesos Start Mesos Agents on all your worker nodes: ./bin/mesos-agent.sh - master=[MasterIp]:5050 -work-dir=/var/lib/mesos Make sure Mesos is up and running with all your relevant worker nodes configured: http://[MasterIP]@5050 Make sure that Spark binary packages are available and accessible by Mesos. They can be placed on a Hadoop-accessible URI for example: HTTP via http:// S3 via s3n:// HDFS via hdfs:// You can also install spark in the same location on all the Mesos slaves, and configure spark.mesos.executor.home to point to that location. Running in Mesos Mesos can have single or multiple masters, which means the Master URL differs when submitting application from Spark via mesos: Single Master Mesos://sparkmaster:5050 Multiple Masters (Using Zookeeper) Mesos://zk://master1:2181, master2:2181/mesos Modes of operation in Mesos Mesos supports both the Client and Cluster modes of operation: Client mode Before running the client mode, you need to perform couple of configurations: Spark-env.sh Export MESOS_NATIVE_JAVA_LIBRARY=<Path to libmesos.so [Linux]> or <Path to libmesos.dylib[MacOS]> Export SPARK_EXECUTOR_URI=<URI of Spark zipped file uploaded to an accessible location e.g. HTTP, HDFS, S3> Set spark.executor.uri to URI of Spark zipped file uploaded to an accessible location e.g. HTTP, HDFS, S3 Batch Applications For batch applications, in your application program you need to pass on the Mesos URL as the master when creating your Spark context. As an example: val sparkConf = new SparkConf()                .setMaster("mesos://mesosmaster:5050")                .setAppName("Batch Application")                .set("spark.executor.uri", "Location to Spark binaries                (Http, S3, or HDFS)") val sc = new SparkContext(sparkConf) If you are using Spark-submit, you can configure the URI in the conf/sparkdefaults.conf file using spark.executor.uri. Interactive applications When you are running one of the provided spark shells for interactive querying, you can pass the master argument e.g: ./bin/spark-shell -master mesos://mesosmaster:5050 Cluster mode Just as in YARN, you run spark on mesos in a cluster mode, which means the driver is launched inside the cluster and the client can disconnect after submitting the application, and get results from the Mesos WebUI. Steps to use the cluster mode Start the MesosClusterDispatcher in your cluster: ./sbin/start-mesos-dispatcher.sh -master mesos://mesosmaster:5050. This will generally start the dispatcher at port 7077. From the client, submit a job to the mesos cluster by Spark-submit specifying the dispatcher URL. Example:        ./bin/spark-submit        --class org.apache.spark.examples.SparkPi        --master mesos://dispatcher:7077        --deploy-mode cluster        --supervise        --executor-memory 2G        --total-executor-cores 10        s3n://path/to/examples.jar Similar to Spark Mesos has lots of properties that can be set to optimize the processing. You should refer to the Spark Configuration page (http://spark.apache.org/docs/latest/configuration.html) for more Information. Mesos run modes Spark can run on Mesos in two modes: Coarse Grained (default-mode): Spark will acquire a long running Mesos task on each machine. This offers a much cost of statup, but the resources will continue to be allocated to spark for the complete duration of the application. Fine Grained (deprecated): The fine grained mode is deprecated as in this case each mesos task is created per Spark task. The benefit of this is each application receives cores as per its requirements, but the initial bootstrapping might act as a deterrent for interactive applications. Key Spark on Mesos configuration properties While Spark has a number of properties that can be configured to optimize Spark processing, some of these properties are specific to Mesos. We'll look at few of those key properties here. Property Name Meaning/Default Value spark.mesos.coarse Setting it to true (default value), will run Mesos in coarse grained mode. Setting it to false will run it in fine-grained mode. spark.mesos.extra.cores This is more of an advertisement rather than allocation in order to improve parallelism. An executor will pretend that it has extra cores resulting in the driver sending it more work. Default=0 spark.mesos.mesosExecutor.cores Only works in fine grained mode. This specifies how many cores should be given to each Mesos executor. spark.mesos.executor.home Identifies the directory of Spark installation for the executors in Mesos. As discussed, you can specify this using spark.executor.uri as well, however if you have not specified it, you can specify it using this property. spark.mesos.executor.memoryOverhead The amount of memory (in MBs) to be allocated per executor. spark.mesos.uris A comma separated list of URIs to be downloaded when the driver or executor is launched by Mesos. spark.mesos.prinicipal The name of the principal used by Spark to authenticate itself with Mesos.   You can find other configuration properties at the Spark documentation page (http://spark.apache.org/docs/latest/running-on-mesos.html#spark-properties). To summarize, we covered the objective to get you started with running Spark on Mesos. To know more about Spark SQL, Spark Streaming, Machine Learning with Spark, you can refer to the book Learning Apache Spark 2.
Read more
  • 0
  • 0
  • 7850