How-To Tutorials

article-image-introduction-raspberry-pi-zero-w-wireless

03 Mar 2018

14 min read

Introduction to Raspberry Pi Zero W Wireless

03 Mar 2018

0
0
7713

article-image-internationalization-and-localization

Packt

03 Mar 2018

16 min read

Internationalization and localization

Packt

03 Mar 2018

16 min read

0
0
2658

article-image-compute-discrete-fourier-transform-dft-using-scipy

Pravin Dhandre

02 Mar 2018

5 min read

How to compute Discrete Fourier Transform (DFT) using SciPy

Pravin Dhandre

02 Mar 2018

5 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book co-authored by L. Felipe Martins, Ruben Oliva Ramos and V Kishore Ayyadevara titled SciPy Recipes. This book provides numerous recipes to tackle day-to-day challenges associated with scientific computing and data manipulation using SciPy stack.[/box] Today, we will compute Discrete Fourier Transform (DFT) and inverse DFT using SciPy stack. In this article, we will focus majorly on the syntax and the application of DFT in SciPy assuming you are well versed with the mathematics of this concept. Discrete Fourier Transforms A discrete Fourier transform transforms any signal from its time/space domain into a related signal in frequency domain. This allows us to not only analyze the different frequencies of the data, but also enables faster filtering operations, when used properly. It is possible to turn a signal in a frequency domain back to its time/spatial domain, thanks to inverse Fourier transform (IFT). How to do it… To follow with the example, we need to continue with the following steps: The basic routines in the scipy.fftpack module compute the DFT and its inverse, for discrete signals in any dimension—fft, ifft (one dimension), fft2, ifft2 (two dimensions), and fftn, ifftn (any number of dimensions). Verify all these routines assume that the data is complex valued. If we know beforehand that a particular dataset is actually real-valued, and should offer realvalued frequencies, we use rfft and irfft instead, for a faster algorithm. In order to complete with this, these routines are designed so that composition with their inverses always yields the identity. The syntax is the same in all cases, as follows: fft(x[, n, axis, overwrite_x]) The first parameter, x, is always the signal in any array-like form. Note that fft performs one-dimensional transforms. This means that if x happens to be two-dimensional, for example, fft will output another two-dimensional array, where each row is the transform of each row of the original. We can use columns instead, with the optional axis parameter. The rest of the parameters are also optional; n indicates the length of the transform and overwrite_x gets rid of the original data to save memory and resources. We usually play with the n integer when we need to pad the signal with zeros or truncate it. For a higher dimension, n is substituted by shape (a tuple) and axis by axes (another tuple). To better understand the output, it is often useful to shift the zero frequencies to the center of the output arrays with ifftshift. The inverse of this operation, ifftshift, is also included in the module. How it works… The following code shows some of these routines in action when applied to a checkerboard: import numpy from scipy.fftpack import fft,fft2, fftshift import matplotlib.pyplot as plt B=numpy.ones((4,4)); W=numpy.zeros((4,4)) signal = numpy.bmat("B,W;W,B") onedimfft = fft(signal,n=16) twodimfft = fft2(signal,shape=(16,16)) plt.figure() plt.gray() plt.subplot(121,aspect='equal') plt.pcolormesh(onedimfft.real) plt.colorbar(orientation='horizontal') plt.subplot(122,aspect='equal') plt.pcolormesh(fftshift(twodimfft.real)) plt.colorbar(orientation='horizontal') plt.show() Note how the first four rows of the one-dimensional transform are equal (and so are the last four), while the two-dimensional transform (once shifted) presents a peak at the origin and nice symmetries in the frequency domain. In the following screenshot, which has been obtained from the previous code, the image on the left is the fft and the one on the right is the fft2 of a 2 x 2 checkerboard signal: Computing the discrete Fourier transform (DFT) of a data series using the FFT Algorithm In this section, we will see how to compute the discrete Fourier transform and some of its Applications. How to do it… In the following table, we will see the parameters to create a data series using the FFT algorithm: How it works… This code represents computing an FFT discrete Fourier in the main part: np.fft.fft(np.exp(2j * np.pi * np.arange(8) / 8)) array([ -3.44505240e-16 +1.14383329e-17j, 8.00000000e+00 -5.71092652e-15j, 2.33482938e-16 +1.22460635e-16j, 1.64863782e-15 +1.77635684e-15j, 9.95839695e-17 +2.33482938e-16j, 0.00000000e+00 +1.66837030e-15j, 1.14383329e-17 +1.22460635e-16j, -1.64863782e-15 +1.77635684e-15j]) In this example, real input has an FFT that is Hermitian, that is, symmetric in the real part and anti-symmetric in the imaginary part, as described in the numpy.fft documentation. import matplotlib.pyplot as plt t = np.arange(256) sp = np.fft.fft(np.sin(t)) freq = np.fft.fftfreq(t.shape[-1]) plt.plot(freq, sp.real, freq, sp.imag) [<matplotlib.lines.Line2D object at 0x...>, <matplotlib.lines.Line2D object at 0x...>] plt.show() The following screenshot shows how we represent the results: Computing the inverse DFT of a data series In this section, we will learn how to compute the inverse DFT of a data series. How to do it… In this section we will see how to compute the inverse Fourier transform. The returned complex array contains y(0), y(1),..., y(n-1) where: How it works… In this part, we represent the calculous of the DFT: np.fft.ifft([0, 4, 0, 0]) array([ 1.+0.j, 0.+1.j, -1.+0.j, 0.-1.j]) Create and plot a band-limited signal with random phases: import matplotlib.pyplot as plt t = np.arange(400) n = np.zeros((400,), dtype=complex) n[40:60] = np.exp(1j*np.random.uniform(0, 2*np.pi, (20,))) s = np.fft.ifft(n) plt.plot(t, s.real, 'b-', t, s.imag, 'r--') plt.legend(('real', 'imaginary')) plt.show() Then we represent it, as shown in the following screenshot: We successfully explored how to transform signals from time or space domain into frequency domain and vice-versa, allowing you to analyze frequencies in detail. If you found this tutorial useful, do check out the book SciPy Recipes to get hands-on recipes to perform various data science tasks with ease.

0
1
15521

article-image-how-to-use-mapreduce-with-mongo-shell

Amey Varangaonkar

02 Mar 2018

8 min read

How to use MapReduce with Mongo shell

Amey Varangaonkar

02 Mar 2018

8 min read

[box type="note" align="" class="" width=""]The following excerpt is taken from the book Mastering MongoDB 3.x authored by Alex Giamas. This book demonstrates the power of MongoDB to build high performance database solutions with ease.[/box] MongoDB is one of the most popular NoSQL databases in the world and can be combined with various Big Data tools for efficient data processing. In this article we explore interesting features of MongoDB, which has been underappreciated and not widely supported throughout the industry as yet - the ability to write MapReduce natively using shell. MapReduce is a data processing method for getting aggregate results from a large set of data. The main advantage is that it is inherently parallelizable as evidenced by frameworks such as Hadoop. A simple example of MapReduce would be as follows, given that our input books collection is as follows: > db.books.find() { "_id" : ObjectId("592149c4aabac953a3a1e31e"), "isbn" : "101", "name" : "Mastering MongoDB", "price" : 30 } { "_id" : ObjectId("59214bc1aabac954263b24e0"), "isbn" : "102", "name" : "MongoDB in 7 years", "price" : 50 } { "_id" : ObjectId("59214bc1aabac954263b24e1"), "isbn" : "103", "name" : "MongoDB for experts", "price" : 40 } And our map and reduce functions are defined as follows: > var mapper = function() { emit(this.id, 1); }; In this mapper, we simply output a key of the id of each document with a value of 1: > var reducer = function(id, count) { return Array.sum(count); }; In the reducer, we sum across all values (where each one has a value of 1): > db.books.mapReduce(mapper, reducer, { out:"books_count" }); { "result" : "books_count", "timeMillis" : 16613, "counts" : { "input" : 3, "emit" : 3, "reduce" : 1, "output" : 1 }, "ok" : 1 } > db.books_count.find() { "_id" : null, "value" : 3 } > Our final output is a document with no ID, since we didn't output any value for id, and a value of 6, since there are six documents in the input dataset. Using MapReduce, MongoDB will apply map to each input document, emitting key-value pairs at the end of the map phase. Then each reducer will get key-value pairs with the same key as input, processing all multiple values. The reducer's output will be a single key-value pair for each key. Optionally, we can use a finalize function to further process the results of the mapper and reducer. MapReduce functions use JavaScript and run within the mongod process. MapReduce can output inline as a single document, subject to the 16 MB document size limit, or as multiple documents in an output collection. Input and output collections can be sharded. MapReduce concurrency MapReduce operations will place several short-lived locks that should not affect operations. However, at the end of the reduce phase, if we are outputting data to an existing collection, then output actions such as merge, reduce, and replace will take an exclusive global write lock for the whole server, blocking all other writes in the db instance. If we want to avoid that we should invoke MapReduce in the following way: > db.collection.mapReduce( Mapper, Reducer, { out: { merge/reduce: bookOrders, nonAtomic: true } }) We can apply nonAtomic only to merge or reduce actions. replace will just replace the contents of documents in bookOrders, which would not take much time anyway. With the merge action, the new result is merged with the existing result if the output collection already exists. If an existing document has the same key as the new result, then it will overwrite that existing document. With the reduce action, the new result is processed together with the existing result if the output collection already exists. If an existing document has the same key as the new result, it will apply the reduce function to both the new and the existing documents and overwrite the existing document with the result. Although MapReduce has been present since the early versions of MongoDB, it hasn't evolved as much as the rest of the database, resulting in its usage being less than that of specialized MapReduce frameworks such as Hadoop. Incremental MapReduce Incremental MapReduce is a pattern where we use MapReduce to aggregate to previously calculated values. An example would be counting non-distinct users in a collection for different reporting periods (that is, hour, day, month) without the need to recalculate the result every hour. To set up our data for incremental MapReduce we need to do the following: Output our reduce data to a different collection At the end of every hour, query only for the data that got into the collection in the last hour With the output of our reduce data, merge our results with the calculated results from the previous hour Following up on the previous example, let's assume that we have a published field in each of the documents, with our input dataset being: > db.books.find() { "_id" : ObjectId("592149c4aabac953a3a1e31e"), "isbn" : "101", "name" : "Mastering MongoDB", "price" : 30, "published" : ISODate("2017-06-25T00:00:00Z") } { "_id" : ObjectId("59214bc1aabac954263b24e0"), "isbn" : "102", "name" : "MongoDB in 7 years", "price" : 50, "published" : ISODate("2017-06-26T00:00:00Z") } Using our previous example of counting books we would get the following: var mapper = function() { emit(this.id, 1); }; var reducer = function(id, count) { return Array.sum(count); }; > db.books.mapReduce(mapper, reducer, { out: "books_count" }) { "result" : "books_count", "timeMillis" : 16700, "counts" : { "input" : 2, "emit" : 2, "reduce" : 1, "output" : 1 }, "ok" : 1 } > db.books_count.find() { "_id" : null, "value" : 2 } Now we get a third book in our mongo_books collection with a document: { "_id" : ObjectId("59214bc1aabac954263b24e1"), "isbn" : "103", "name" : "MongoDB for experts", "price" : 40, "published" : ISODate("2017-07-01T00:00:00Z") } > db.books.mapReduce( mapper, reducer, { query: { published: { $gte: ISODate('2017-07-01 00:00:00') } }, out: { reduce: "books_count" } } ) > db.books_count.find() { "_id" : null, "value" : 3 } What happened here, is that by querying for documents in July 2017 we only got the new document out of the query and then used its value to reduce the value with the already calculated value of 2 in our books_count document, adding 1 to the final sum of three documents. This example, as contrived as it is, shows a powerful attribute of MapReduce: the ability to re-reduce results to incrementally calculate aggregations over time. Troubleshooting MapReduce Throughout the years, one of the major shortcomings of MapReduce frameworks has been the inherent difficulty in troubleshooting as opposed to simpler non-distributed patterns. Most of the time, the most effective tool is debugging using log statements to verify that output values match our expected values. In the mongo shell, this being a JavaScript shell, this is as simple as outputting using the console.log()function. Diving deeper into MapReduce in MongoDB we can debug both in the map and the reduce phase by overloading the output values. Debugging the mapper phase, we can overload the emit() function to test what the output key values are: > var emit = function(key, value) { print("debugging mapper's emit"); print("key: " + key + " value: " + tojson(value)); } We can then call it manually on a single document to verify that we get back the key-value pair that we would expect: > var myDoc = db.orders.findOne( { _id: ObjectId("50a8240b927d5d8b5891743c") } ); > mapper.apply(myDoc); The reducer function is somewhat more complicated. A MapReduce reducer function must meet the following criteria: It must be idempotent The order of values coming from the mapper function should not matter for the reducer's result The reduce function must return the same type of result as the mapper function We will dissect these following requirements to understand what they really mean: It must be idempotent: MapReduce by design may call the reducer multiple times for the same key with multiple values from the mapper phase. It also doesn't need to reduce single instances of a key as it's just added to the set. The final value should be the same no matter the order of execution. This can be verified by writing our own "verifier" function forcing the reducer to re-reduce or by executing the reducer many, many times: reduce( key, [ reduce(key, valuesArray) ] ) == reduce( key, valuesArray ) It must be commutative: Again, because multiple invocations of the reducer may happen for the same key, if it has multiple values, the following should hold: reduce(key, [ C, reduce(key, [ A, B ]) ] ) == reduce( key, [C, A, B ] ) The order of values coming from the mapper function should not matter for the reducer's result: We can test that the order of values from the mapper doesn't change the output for the reducer by passing in documents to the mapper in a different order and verifying that we get the same results out: reduce( key, [ A, B ] ) == reduce( key, [ B, A ] ) The reduce function must return the same type of result as the mapper function: Hand-in-hand with the first requirement, the type of object that the reduce function returns should be the same as the output of the mapper function. We saw how MapReduce is useful when implemented on a data pipeline. Multiple MapReduce commands can be chained to produce different results. An example would be aggregating data by different reporting periods (hour, day, week, month, year) where we use the output of each more granular reporting period to produce a less granular report. If you found this article useful, make sure to check our book Mastering MongoDB 3.x to get more insights and information about MongoDB’s vast data storage, management and administration capabilities.

0
0
6205

article-image-preparing-spring-web-development-environment

Packt

02 Mar 2018

28 min read

Preparing the Spring Web Development Environment

Packt

02 Mar 2018

28 min read

In this article by Ajitesh Kumar, the author of the book Building Web Apps with Spring 5 and Angular, we will see the key aspects of web request-response handling in relation with Spring Web MVC framework. In this article, we will go into the details of setting up development environment for working with Spring web applications. Following are going to be the key areas we are going to look into: Installing Java SDK Installing/configuring Maven Installing Eclipse IDE Installing/configuring Apache Tomcat Server Installing/configuring MySQL Database Introducing Docker containers Setting up development environment using Docker-compose (For more resources related to this topic, see here.) Installing Java SDK First and foremost, we will install Java SDK. We will work with Java 8 throughout this book. Go ahead and access this page (http://www.oracle.com/technetwork/java/javase /downloads/jdk8-downloads-2133151.html). Download the appropriate JDK kit. For Windows OS, there are two different versions, one for x86 and another for x64. One should select appropriate version and download “exe” file. Once downloaded, double-click on the executable file. This would start the installer. Once installed, following needs to be done: Set the JAVA_HOME as the path where JDK is installed. Include the path in %JAVA_HOME%/bin in the environment variable. One could do that by adding the %JAVA_HOME%/bin directory to his/her user PATH environment variable by opening up the system properties (WinKey + Pause), selecting the “Advanced” tab, and the “Environment Variables” button, then adding or selecting the PATH variable in the user variables with the value. Once done with the preceding steps, open a shell and type the command, "java - version". It should print the version of Java you installed just now. Next, let us try and understand how to install and configure Maven, a tool for building and managing Java projects. Installing/Configuring Maven Maven is a tool which can be used for building and managing Java-based project. Following are some of the key benefits of using Maven as a build tool: It provides a simple project setup that follows best practices - Get a new project or module started in seconds. It allows a project to build using its Project Object Model (POM) and a set of plugins that are shared by all projects using Maven, providing a uniform build system. It allows usage of large and growing repository of libraries and metadata to use out of the box. Based on model based builds, it provides ability to work with multiple projects at the same time. Any number of projects can be built into predefined output types such as a JAR, WAR, or distribution based on metadata about the project, without the need to do any scripting in most cases. One can download Maven from https://maven.apache.org/download.cgi. Before installing Maven, make sure Java is installed and configured (JAVA_HOME) appropriately as mentioned in the previous section. On Windows, you could check the same by typing the command, “echo %JAVA_HOME%”: Extract distribution archive in any directory. If it works on Windows, install unzip tool such as WinRAR. Right-click on the ZIP file and unzip it. A directory (with name as “apache-maven-3.3.9”, the version of maven at the time of writing) holding the files such as bin, conf and so on will be created. Add the bin directory of the created directory, “apache-maven-3.3.9” to the PATH environment variable. One could do that by adding the bin directory to his/her user PATH environment variable by opening up the system properties (WinKey + Pause), selecting the “Advanced” tab, and the “Environment Variables” button, then adding or selecting the PATH variable in the user variables with the value Open a new shell and type, “mvn -v”. The result should print the Maven version along with details including Java version, Java home, OS name, and so on. Now, let’s look at how can we create a Java project using Maven from command prompt before we get on to creating a Maven project in Eclipse IDE. Use following mvn command to create a Java project: mvn archetype:generate -DgroupId=com.healthapp -DartifactId=HealthApp - DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false With archetype:generate and -DarchetypeArtifactId=maven-archetypequickstart template, following project directory structure is created: In the preceding diagram, healthapp folders within src/main and src/test folder consist of a hello world program named as "App.java" and a corresponding test program such as "AppTest.java". Also, the at the top most folder, a pom.xml file is created. In the next section, we will install Eclipse IDE and create a maven project using the functionality provided by the IDE. Installing Eclipse IDE In this section, we will get ourselves setup with Eclipse IDE, a tool used by Java developers to create Java EE and web applications. Go to Eclipse website, http://www.eclipse.org and download the latest version of Eclipse and install thereafter. As we shall be working with web applications, select the option such as "Eclipse IDE for Java EE Developers" while downloading the IDE. As you launch the IDE, it will ask to select a folder for workspace. Select appropriate path and start the IDE. Following are some of the different types of projects developers could work using IDE: A new Java EE Web project A new JavaScript project. This option will be very useful when you are working with standalone JavaScript project and planning to integrate with server components using APIs. Checkout existing Eclipse projects from Git and work on them Import one or more existing Eclipse projects from filesystem or archive Import existing Maven Project in Eclipse In previous section, we have created a maven project namely HealthApp. We will now see how we can import this project into Eclipse IDE: Click File > import. Type Maven in the search box under Select an import source. Select Existing Maven Projects. Click Next. Click Browse and select the HealthApp folder which is the root of the Maven project. Note that it contains the pom.xml file. Click Finish. The project will be imported in Eclipse. Make sure this is how it looks like: Figure 2: Maven project imported into Eclipse Let's also see how one can create a new Maven project with Eclipse IDE. Create new Maven Project in Eclipse Follow the instructions given to create new Java Maven project with Eclipse IDE: Click File > New > Project. Type Maven in the search box under Wizards. Select Maven project. A dialog box with title as "New Maven Project", having option "use default Workspace location" as checked, appears. Make sure that Group Id is selected as org.apache.maven.archetypes with Artifact Id selected as maven-archetype-quickstart. Give a name to Group Id, say, "com.orgname". Give a name to Artifact Id, say, "healthapp2". Click Finish. As a result of preceding steps, a new Maven project will be created in Eclipse. Make sure this is how it looks like: Figure 3: Maven project created within Eclipse In next section, we will see how to install and configure Tomcat Server. Installing/Configuring Apache Tomcat Server In this section, we will learn about some of the following: How to install and configure Apache Tomcat server Common deployment approaches with Tomcat server How to add Tomcat server in Eclipse The Apache Tomcat software is an open source implementation of the Java Servlet, JavaServer Pages (JSPs), Java Expression Language and Java WebSocket technologies. We will work with Apache Tomcat 8.x version in this book. We will look at both Windows and Unix version of Java. One can go to http://tomcat.apache.org/ and download the appropriate version from this page. At the time of installation, it requires you to choose the path to one of the JREs installed on your computer. Once installation is complete, Apache Tomcat server is started as a Windows service. With default installation options, one can then access the Tomcat server by accessing URL such as http://127.0.0.1:8080/. A page such as following will be displayed: Figure 4: Apache Tomcat Server Homepage Following is how Tomcat's folder structure looks like: Figure 5: Apache Tomcat Folder Structure In the preceding diagram, note the "webapps" folder which will contain our web apps. The following description uses the variable name such as following: $CATALINA_HOME, the directory into which Tomcat is installed. $CATALINA_BASE, the base directory against which most relative paths are resolved. If you have not configured Tomcat for multiple instances by setting a CATALINA_BASE directory, then $CATALINA_BASE will be set to the value of $CATALINA_HOME. Following are most commonly used approaches to deploy web apps in Tomcat: Copy unpacked directory hierarchy into a subdirectory in directory $CATALINA_BASE/webapps/. Tomcat will assign a context path to your application based on the subdirectory name you choose. Copy the web application archive (WAR) file into directory $CATALINA_BASE/webapps/. When Tomcat is started, it will automatically expand the web application archive file into its unpacked form, and execute the application that way. Let us learn how to configure Apache Tomcat from within Eclipse. This would be very useful as one could start and stop Tomcat from Eclipse while working with his/her web applications. Adding/Configuring Apache Tomcat in Eclipse In this section, we will learn how to add and configure Apache Tomcat in Eclipse. It would help to start and stop the server from within Eclipse IDE. Following steps need to be taken to achieve this objective: Make sure you are in Java EE perspective. Click on "Servers" tab in lower panel. You will find a link saying "No servers are available. Click this link to create a new server...". Click on this link. Type "Tomcat" under Select the server type. It would show a list of Tomcat server with different versions. Select "Tomcat v8.5 Server" and click Next. Select the Tomcat installation directory. Click on "Installed JREs..." button and make sure that appropriate JRE is checked. Click Next. Click Finish. This would create an entry for Tomcat server in "Servers" tab. Double-click on Tomcat server. This would open up a configuration window where multiple options such as Server Locations, Server Options, Ports can be configured. Under Server Locations, click on "Browse Path" button to select the path to "webapps" folder within your local Tomcat installation folder. Once done, save it using Ctrl-S. Right click on "Tomcat Server" link listed under "Servers" panel and click "Start". This should start the server. You should be able to access the Tomcat page on the URL, http://localhost:8080/. Installing/Configuring MySQL Database In this section, we will learn on how to install MySQL database. Go to MySQL Downloads site (https://www.mysql.com/downloads/) and click on "Community (GPL) Downloads" under MySQL community edition. On the next page, you will see listing of several MySQL software packages. Download following: MySQL Community Server MySQL Connector for Java development (Connector/J) Installing/Configuring MySQL Server In this section, we will see how to download, install and configure the MySQL database and related utility such as MySQL Workbench. Note that MySQL Workbench is a unified visual tool which can be used by database architects, developers and DBA for activities such as data modeling, SQL development, and comprehensive administration tools for server configuration, user administration etc. Follow the instructions given for installation & configuration of MySQL server and workbench: Click on "Download" link under "MySQL Community Server (GPL)" found as first entry on "MySQL Community Downloads" page. We shall be working with Windows version of MySQL in the following instructions. Click the "Download" button against the entry "Windows (x86, 32-bit), MySQL Installer MSI". This would download the an exe file such as mysql-installercommunity-5.7.16.0.exe. Double-click on the installer to start the installation. As you progress ahead after accepting the license terms and condition, you would find the interactive UI such as following. Choose the appropriate version of MySQL server and also, MySQL Workbench and click on Next. Figure 6: Selecting and installing MySQL Server and MySQL Workbench Clicking on Execute would install the MySQL server and MySQL workbench as shown in the following diagram: Figure 7: MySQL Server and Workbench installation in progress Once installation is complete, next few steps would require you to configure the MySQL database including setting root password, adding one or more users, opting to start MySQL server as a Windows service and so on. The quickest way will be to use default instructions as much as possible and finish the installation. Once all is done, you would see UI such as following: Figure 8: Completion of MySQL Server and Workbench installation Clicking on "Finish" button will take on the next window where you could choose to start MySQL workbench. Following is how the MySQL Workbench would look like after you click on MySQL server instance on the Workbench homepage, enter the root password and execute "Show databases" command: Figure 9: MySQL Workbench Using MySQL Connector Before testing MySQL database connection from Java program, one would need to add the MySQL JDBC connector library to the classpath. In this section, we will learn how to configure/add MySQL JDBC connector library to classpath while working with Eclipse IDE or command console. The MySQL connector (Connector/J) comes in ZIP file (*.tar.gz). The MySQL connector is a concrete implementation of JDBC API. Once extracted, one can see a JAR file with name such as mysql-connector-java-xxx.jar. Following are different ways in which this JAR file is dealt with while working with or without IDEs such as Eclipse: While working with Eclipse IDE, one can add the JAR file to the classpath by adding it as Library to the Build Path in project's properties. While working with command console, one needs to specify the path to the JAR file in the -cp or -classpath argument when executing the Java application. Following is the sample command representing the preceding: java -cp .;/path/to/mysql-connector-java-xxx.jar com.healthapp.JavaClassName Note the "." in classpath (-cp) option. This is there to add the current directory to the classpath as well such that com.healthapp.JavaClassName can be located. Connecting to MySQL Database from a Java Class In this section, we will learn how to test the MySQL database connection from a Java program. Before executing the code shown as follows in your Eclipse IDE, make sure to do the following: Add the MySQL connector jar file by right-clicking on top-level project folder, clicking on "Properties", clicking on "Java Build Path" and, then, adding mysqlconnector-java-xxx.jar file by clicking on "Add External JARs...": Figure 10: Adding MySQL Java Connector to Java Build Path in Eclipse IDE Create a MySQL database namely "healthapp". You could do that by accessing MySQL Workbench and executing the MySQL command such as "create database healthapp". Following diagram represents the same: Figure 11: Creating new MySQL Database using MySQL Workbench Once done with the preceding steps, use the following code to test the connection to MySQL database from your Java class. On successful connection, you should be able to see "Database connected!" getting printed. import java.sql.Connection; import java.sql.DriverManager; import java.sql.SQLException; /** * Sample program to test MySQL database connection */ public class App { public static void main( String[] args ) { String url = "jdbc:mysql://localhost:3306/healthapp"; String username = "root"; String password = "r00t"; //Root password set during MySQL installation procedure as described above. System.out.println("Connecting database..."); try { Connection connection = DriverManager.getConnection(url, username, password); System.out.println("Database connected!"); } catch (SQLException e) { throw new IllegalStateException("Cannot connect the database!", e); } } } Introduction to Dockers Docker is a virtualization technology which helps IT organizations achieve some of the following: Enable Dev/QA team develop and test applications in a quick and easy manner in any environment. Break the barriers between Dev/QA and Operations teams during software development life cycle (SDLC) processes. Optimize infrastructure usage in the most appropriate manner. In this section, we will emphasize on first point which would help us setup Spring web application development in quick and easy manner. So far, we have seen traditional manners in which we could set up the Java web application development environment by installing different tools in independent manner and later configuring them appropriately. In a traditional setup, one would be required to setup and configure Java, Maven, Tomcat, MySQL server and so on, one tool at a time, by following manual steps. On the same lines, you could see that all of the steps described in preceding sections have to be performed one-by-one in manual fashion. Following are some of the disadvantages of setting up development/test environments in this manner: Conflicting Runtimes: If a need arises to use software packages (say, different versions of Java and Tomcat) of different versions to run and test the same web application, it can become very cumbersome to manually set up the environment having different versions of software. Environments getting corrupted: If more than one developers are working in a particular development environment, there are chances that the environment could get corrupted due to changes made by one developer while others are not aware about. And, that generally leads to developers'/team's productivity loss due to time spent in fixing the configuration issue or re-installing the development environment from scratch. "Works for me" syndrome: Have you come across another member of your team saying that the application works in their environment although the application seems to have broken? New Developers/Testers' On-boarding: If there is a need to quickly on-board the new developers, manually setting up development environment takes some significant amount of time depending upon the applications' complexity. All of the praceding disadvantages could be taken care by making use of Dockers technology. In this section, we will learn briefly about some of the following: What are Docker Containers? What are key building blocks of Docker containers? Installing Dockers Useful commands to work with Docker containers What are Docker Containers? In this section, we will try and understand what are Docker containers while comparing them with real-world containers. Simply speaking, Docker is an open platform for developing, shipping and running applications. It provides the ability to package and run an application in a loosely isolated environment called a container. Before going into details of Docker containers, let us try and understand the problems that are solved by real-world containers. What are real-world containers good for? Following picture represents real world containers which are used to package annything and everything and, then, transport the goods from one place to other in an easy and safe manner:Figure 12: Real-world containers The following diagram represents different form of goods which needs to be transported using different from of transport mechanisms from one place to another: Figure 13: Different forms of goods vis-a-vis different form of transport mechanisms The following diagram displays the matrix representing need to transport each of the goods via different transport mechanism. The challenge is to make sure that these goods get transported in easy and safe manner: Figure 14: Complexity associated with transporting goods of different types using different transport mechanisms In order to solve preceding problem of transporting the goods in safe and easy manner irrespective of transport medium, the containers are used. Look at the following diagram: Figure 15: Goods can be packed within containers, and containers can be transported. How does Docker containers relate to the real-world containers? Now imagine the act of moving a software application from one environment to another environment starting from development right up to production. Following diagram represents complexity associated with making different application components work in different environments: Figure 16: Complexity associated with making different application components work in different environments As per the preceding diagram, to make different application components work in different environments (different hardware platforms), one would require to make sure environment compatible software versions and related configurations are set appropriately. Doing this using manual steps can be real cumbersome and error prone task. This is where docker containers fit in. Following diagram represents containerizing different application components using Docker containers. As like real-world containers, it would become very easy to move the containerized application components from one environment to another with very less or no issues: Figure 17: Docker containers to move application components across different environments Docker containers In simple terms, Docker containers provide an isolated and secured environment for the application components to run. The isolation and security allows one or many containers to run simultaneously on a given host. Often, for simplicity sake, Docker containers are loosely termed as lightweight-VMs (Virtual Machine). However, they are very much different from the traditional VMs. Docker containers do not need hypervisors to run as like virtual machines and, thus, multiple containers can be run on a given hardware combination. Virtual machines include the application, the necessary binaries and libraries, and an entire guest operating system; all of which can amount to tens of GBs. On the other hand, Docker Containers include the application and all of its dependencies; but share the kernel with other containers, running as isolated processes in user space on the host operating system. Docker containers are not tied to any specific infrastructure: they run on any computer, on any infrastructure, and in any cloud. This very aspect make them look like a real-world container. Following diagram sums it all: Figure 18: Difference between traditional VMs and Docker containers Following are some of the key building blocks of Docker technology: Docker Containers: Isolated and secured environment for applications to run. Docker engine: A client-server application having following components: Daemon process used to create and manage Docker objects, such as images, containers, networks, and data volumes. A REST API interface and A command line interface (CLI) client Docker client: Client program that invokes Docker engine using APIs. Docker host: Underlying operating system sharing the kernel space with Docker containers. Until recently, Windows OS needed Linux virtualization to host Docker containers. Docker hub: Public repository used to manage Docker images posted by various users. Images made public are available for all to download in order to create containers using those images. What are key building blocks of Dockers containers? For setting up our development environment, we will rely on Docker containers and assemble them together using the tool called as Docker compose which we shall learn about little later. Let us understand some of the following which can also be termed as key building blocks of Docker containers: Docker image: In simple terms, Docker image can be thought of as a "class" in Java. Docker containers can be thought of as running instances of the image as like having one or more "instances" of a Java class. Technically speaking, Docker images consist of a list of layers that are stacked on top of each other to form a base for containers' root file system. Following diagram represents command which can be used to create a Docker container using an image named helloworld: Figure 19: Docker command representing creation of Docker container using a Docker image. In order to set up our development environment, we will require images of following to create the respective Docker containers: - Tomcat - MySQL. Dockerfile: Dockerfile is a text document that contains all the commands which could be called on the command line to assemble or build an image. docker build command is used to build an image from a Dockerfile and a context. In order to create custom images for Tomcat and MySQL, it may be required to create a Dockerfile and, then, build the image. Following is a sample command for building an image using a Dockerfile: docker build -f tomcat.df -t tomcat_debug The preceding command would look for the Dockerfile "tomcat.df" in the current directory specified by "." and build the image with tag, "tomcat_debug". Installing Dockers Now that we have got an understanding on What are Dockers, lets install Dockers. We shall look into steps that are required to install Dockers on Windows OS: Download the Windows version of Docker Toolbox from the webpage, https https://www.docker.com/products/docker-toolbox. Docker toolbox comes as an installer which can be double-clicked for quick setup and launch of the docker environment. Following comes with Docker toolbox installation: Docker Machine for running docker-machine commands. Docker Engine for running the docker commands. Docker Compose for running the docker-compose commands. This is what we are looking for. Kitematic, the Docker GUI. A shell preconfigured for a Docker command-line environment. Oracle VirtualBox. Setting up Development Environment using Docker Compose In this section, we will learn how to setup on-demand, self-service development environment using Docker compose. Following are some of the points covered in this section: What is Docker compose? Docker compose script for setting up the development environment What is Docker Compose? Docker compose is a tool for defining and running multi-container Docker applications. One will require to create a Compose file to configure the application's services. Following steps are required to be taken in order to work with Docker compose: Define the application’s environment with a Dockerfile so it can be reproduced anywhere. Define the services that make up the application in docker-compose.yml so they can be run together in an isolated environment. Lastly, run docker-compose up and Compose will start and run the entire application. As we are going to setup a multi-container applications using Tomcat and MySQL as different containers, we will use Docker compose to configure both of them and, then, assemble the application. Docker Compose script for setting up the development environment In order to come up with a Docker compose script which can set up our Spring Web App development environment with one script execution, we will first set up images for following by creating independent Dockerfiles. Tomcat 8.x with Java and Maven installed as one container MySQL as another container Setting up Tomcat 8.x as a Container Service Following steps can be used to setup Tomcat 8.x along with Java 8 and Maven 3.x as one container: Create a folder and put following files within the folder. The source code for the files will be given as follows: tomcat.df create_tomcat_admin_user.sh run.sh Copy following source code for tomcat.df: FROM phusion/baseimage:0.9.17 RUN echo "deb http://archive.ubuntu.com/ubuntu trusty main universe" > /etc/apt/sources.list RUN apt-get -y update RUN DEBIAN_FRONTEND=noninteractive apt-get install -y -q pythonsoftware-properties software-properties-common ENV JAVA_VER 8 ENV JAVA_HOME /usr/lib/jvm/java-8-oracle RUN echo 'deb http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main' >> /etc/apt/sources.list && echo 'deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main' >> /etc/apt/sources.list && apt-key adv --keyserver keyserver.ubuntu.com --recv-keys C2518248EEA14886 && apt-get update && echo oracle-java${JAVA_VER}-installer shared/accepted-oraclelicense-v1-1 select true | sudo /usr/bin/debconf-set-selections && apt-get install -y --force-yes --no-install-recommends oraclejava${JAVA_VER}-installer oracle-java${JAVA_VER}-set-default && apt-get clean && rm -rf /var/cache/oracle-jdk${JAVA_VER}-installer RUN update-java-alternatives -s java-8-oracle RUN echo "export JAVA_HOME=/usr/lib/jvm/java-8-oracle" >> ~/.bashrc RUN apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* ENV MAVEN_VERSION 3.3.9 RUN mkdir -p /usr/share/maven && curl -fsSL http://apache.osuosl.org/maven/maven-3/$MAVEN_VERSION/binaries/apac he-maven-$MAVEN_VERSION-bin.tar.gz | tar -xzC /usr/share/maven --strip-components=1 && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn ENV MAVEN_HOME /usr/share/maven VOLUME /root/.m2 RUN apt-get update && apt-get install -yq --no-install-recommends wget pwgen cacertificates && apt-get clean && rm -rf /var/lib/apt/lists/* ENV TOMCAT_MAJOR_VERSION 8 ENV TOMCAT_MINOR_VERSION 8.5.8 ENV CATALINA_HOME /tomcat RUN wget -q https://archive.apache.org/dist/tomcat/tomcat-${TOMCAT_MAJOR_VERSIO N}/v${TOMCAT_MINOR_VERSION}/bin/apache-tomcat${TOMCAT_MINOR_VERSION}.tar.gz && wget -qOhttps://archive.apache.org/dist/tomcat/tomcat-${TOMCAT_MAJOR_VERSIO N}/v${TOMCAT_MINOR_VERSION}/bin/apache-tomcat${TOMCAT_MINOR_VERSION}.tar.gz.md5 | md5sum -c - && tar zxf apache-tomcat-*.tar.gz && rm apache-tomcat-*.tar.gz && mv apache-tomcat* tomcat ADD create_tomcat_admin_user.sh /create_tomcat_admin_user.sh RUN mkdir /etc/service/tomcat ADD run.sh /etc/service/tomcat/run RUN chmod +x /*.sh RUN chmod +x /etc/service/tomcat/run EXPOSE 8080 CMD ["/sbin/my_init"] Copy following code in a file named as create_tomcat_admin_user.sh. This file should be created in the same folder as preceding file, tomcat.df. While copying into notepad and later using with docker terminal, you may find Ctrl-M character inserted at the end of the line. Make sure that those lines are appropriately handled and removed: #!/bin/bash if [ -f /.tomcat_admin_created ]; then echo "Tomcat 'admin' user already created" exit 0 fi PASS=${TOMCAT_PASS:-$(pwgen -s 12 1)} _word=$( [ ${TOMCAT_PASS} ] && echo "preset" || echo "random" ) echo "=> Creating an admin user with a ${_word} password in Tomcat" sed -i -r 's/</tomcat-users>//' ${CATALINA_HOME}/conf/tomcatusers.xml echo '<role rolename="manager-gui"/>' >> ${CATALINA_HOME}/conf/tomcat-users.xml echo '<role rolename="manager-script"/>' >> ${CATALINA_HOME}/conf/tomcat-users.xml echo '<role rolename="manager-jmx"/>' >> ${CATALINA_HOME}/conf/tomcat-users.xml echo '<role rolename="admin-gui"/>' >> ${CATALINA_HOME}/conf/tomcat-users.xml echo '<role rolename="admin-script"/>' >> ${CATALINA_HOME}/conf/tomcat-users.xml echo "<user username="admin" password="${PASS}" roles="managergui,manager-script,manager-jmx,admin-gui, admin-script"/>" >> ${CATALINA_HOME}/conf/tomcat-users.xml echo '</tomcat-users>' >> ${CATALINA_HOME}/conf/tomcat-users.xml echo "=> Done!" touch /.tomcat_admin_created echo "================================================================== ======" echo "You can now configure to this Tomcat server using:" echo "" echo " admin:${PASS}" echo "" echo "================================================================== ======" Copy following code in a file named as run.sh in the same folder as preceding two files: #!/bin/bash if [ ! -f /.tomcat_admin_created ]; then /create_tomcat_admin_user.sh fi exec ${CATALINA_HOME}/bin/catalina.sh run Open up a Docker terminal and go to folder where these files are located. Execute following command to create the Tomcat image. In few minutes, the tomcat image will be created: docker build -f tomcat.df -t demo/tomcat:8 . Execute the command such as following and make sure that an image with name as demo/tomcat is found: docker images Next, run a container with name such as "tomcatdev" using following command: docker run -ti -d -p 8080:8080 --name tomcatdev -v "$PWD":/mnt/ demo/tomcat:8 Open a browser and type the URL as http://192.168.99.100:8080/. You should be able to see following page getting loaded. Note the URL and the Tomcat version, 8.5.8. This is the same version we earlier installed (check figure 1.4): Figure 20: Tomcat 8.5.8 installed as a Docker container You could access the container through the terminal using command with following command. Make sure to check the Tomcat installation inside folder "/tomcat". Also, execute command such as "java -version" and "mvn -v" to check the version of Java and Maven respectively: docker exec -ti tomcatdev /bin/bash In this section, we learnt to setup Tomcat 8.5.8 along with Java 8 and Maven 3.x as one container. Setting up MySQL as a Container Service In this section, we will learn how to setup MySQL as a container service. In the docker terminal, execute the following command: docker run -ti -d -p 3326:3306 --name mysqldev -e MYSQL_ROOT_PASSWORD=r00t -v "$PWD":/mnt/ mysql:5.7 The preceding command setup MySQL 5.7 version within the container and starts the mysqld service. Open MySQL Workbench and create a new connection by entering the details such as following and click "Test Connection". You should be able to establish the connection successfully: Figure 21: MySQL server running in the container and accessible from host machine at 3326 port using MySQL Workbench Docker Compose script to setup the Dev Environment Now, that we have setup both Tomcat and MySQL as individual containers, let us learn to create a Docker compose script using which both the containers can be started simultaneously thereby starting the Dev environment. Save following source code as docker-compose.yml in the same folder as preceding mentioned files: version: '2' services: web: build: context: . dockerfile: tomcat.df ports: - "8080:8080" volumes: - .:/mnt/ links: - db db: image: mysql:5.7 ports: - "3326:3306" environment: - MYSQL_ROOT_PASSWORD=r00t Execute following command to start and stop the services: // For starting the services in the foreground docker-compose up // For starting the services in the background (detached mode) docker-compose up -d // For stopping the services docker-compose stop Test whether both the default Tomcat web app and MySQL server can be accessed. Access the URL, 192.168.99.100:8080 and make sure that the web page as shown in figure 1.20 is displayed. Also, open MySQL Workbench and access the MySQL server at IP, 192.168.99.100 and port 3326 (as specified in the preceding docker-compose.yml file). Summary In this article, we learnt how we could start and stop the Web app Dev environment on- demand. Note that with these scripts including Dockerfiles, shell scripts and Dockercompose file, you could setup the Dev environment on any machine where Docker Toolbox could be installed. Resources for Article: Further resources on this subject: Building Web Apps with Spring 5 and Angular 4 Spring 5 Design Patterns

0
0
3937

How-To Tutorials

article-image-implementing-apache-spark-k-means-clustering-method-on-digital-breath-test-data-for-road-safety

Savia Lobo

01 Mar 2018

7 min read

Implementing Apache Spark K-Means Clustering method on digital breath test data for road safety

Savia Lobo

01 Mar 2018

7 min read

[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Mastering Apache Spark 2.x - Second Edition written by Romeo Kienzler. In this book, you will learn to use Spark as a big data operating system, understand how to implement advanced analytics on the new APIs, and explore how easy it is to use Spark in day-to-day tasks.[/box] In today’s tutorial, we have used the Road Safety test data from our previous article, to show how one can attempt to find clusters in data using K-Means algorithm with Apache Spark MLlib. Theory on Clustering The K-Means algorithm iteratively attempts to determine clusters within the test data by minimizing the distance between the mean value of cluster center vectors, and the new candidate cluster member vectors. The following equation assumes dataset members that range from X1 to Xn; it also assumes K cluster sets that range from S1 to Sk, where K <= n. K-Means in practice The K-Means MLlib functionality uses the LabeledPoint structure to process its data and so it needs numeric input data. As the same data from the last section is being reused, we will not explain the data conversion again. The only change that has been made in data terms in this section, is that processing in HDFS will now take place under the /data/spark/kmeans/ directory. Additionally, the conversion Scala script for the K-Means example produces a record that is all comma-separated. The development and processing for the K-Means example has taken place under the /home/hadoop/spark/kmeans directory to separate the work from other development. The sbt configuration file is now called kmeans.sbt and is identical to the last example, except for the project name: name := "K-Means" The code for this section can be found in the software package under chapter7K-Means. So, looking at the code for kmeans1.scala, which is stored under kmeans/src/main/scala, some similar actions occur. The import statements refer to the Spark context and configuration. This time, however, the K-Means functionality is being imported from MLlib. Additionally, the application class name has been changed for this example to kmeans1: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.clustering.{KMeans,KMeansModel} object kmeans1 extends App { The same actions are being taken as in the last example to define the data file--to define the Spark configuration and create a Spark context: val hdfsServer = "hdfs://localhost:8020" val hdfsPath = "/data/spark/kmeans/" val dataFile = hdfsServer + hdfsPath + "DigitalBreathTestData2013- MALE2a.csv" val sparkMaster = "spark://localhost:7077" val appName = "K-Means 1" val conf = new SparkConf() conf.setMaster(sparkMaster) conf.setAppName(appName) val sparkCxt = new SparkContext(conf) Next, the CSV data is loaded from the data file and split by comma characters into the VectorData variable: val csvData = sparkCxt.textFile(dataFile) val VectorData = csvData.map { csvLine => Vectors.dense( csvLine.split(',').map(_.toDouble)) } A KMeans object is initialized, and the parameters are set to define the number of clusters and the maximum number of iterations to determine them: val kMeans = new KMeans val numClusters = 3 val maxIterations = 50 Some default values are defined for the initialization mode, number of runs, and Epsilon, which we needed for the K-Means call but did not vary for the processing. Finally, these parameters were set against the KMeans object: val initializationMode = KMeans.K_MEANS_PARALLEL val numRuns = 1 val numEpsilon = 1e-4 kMeans.setK( numClusters ) kMeans.setMaxIterations( maxIterations ) kMeans.setInitializationMode( initializationMode ) kMeans.setRuns( numRuns ) kMeans.setEpsilon( numEpsilon ) We cached the training vector data to improve the performance and trained the KMeans object using the vector data to create a trained K-Means model: VectorData.cache val kMeansModel = kMeans.run( VectorData ) We have computed the K-Means cost and number of input data rows, and have output the results via println statements. The cost value indicates how tightly the clusters are packed and how separate the clusters are: val kMeansCost = kMeansModel.computeCost( VectorData ) println( "Input data rows : " + VectorData.count() ) println( "K-Means Cost : " + kMeansCost ) Next, we have used the K-Means Model to print the cluster centers as vectors for each of the three clusters that were computed: kMeansModel.clusterCenters.foreach{ println } Finally, we use the K-Means model predict function to create a list of cluster membership predictions. We then count these predictions by value to give a count of the data points in each cluster. This shows which clusters are bigger and whether there really are three clusters: val clusterRddInt = kMeansModel.predict( VectorData ) val clusterCount = clusterRddInt.countByValue clusterCount.toList.foreach{ println } } // end object kmeans1 So, in order to run this application, it must be compiled and packaged from the kmeans subdirectory as the Linux pwd command shows here: [hadoop@hc2nn kmeans]$ pwd /home/hadoop/spark/kmeans [hadoop@hc2nn kmeans]$ sbt package Loading /usr/share/sbt/bin/sbt-launch-lib.bash [info] Set current project to K-Means (in build file:/home/hadoop/spark/kmeans/) [info] Compiling 2 Scala sources to /home/hadoop/spark/kmeans/target/scala-2.10/classes... [info] Packaging /home/hadoop/spark/kmeans/target/scala-2.10/k- means_2.10-1.0.jar ... [info] Done packaging. [success] Total time: 20 s, completed Feb 19, 2015 5:02:07 PM Once this packaging is successful, we check HDFS to ensure that the test data is ready. As in the last example, we convert our data to numeric form using the convert.scala file, provided in the software package. We will process the DigitalBreathTestData2013- MALE2a.csv data file in the HDFS directory, /data/spark/kmeans, as follows: [hadoop@hc2nn nbayes]$ hdfs dfs -ls /data/spark/kmeans Found 3 items -rw-r--r-- 3 hadoop supergroup 24645166 2015-02-05 21:11 /data/spark/kmeans/DigitalBreathTestData2013-MALE2.csv -rw-r--r-- 3 hadoop supergroup 5694226 2015-02-05 21:48 /data/spark/kmeans/DigitalBreathTestData2013-MALE2a.csv drwxr-xr-x - hadoop supergroup 0 2015-02-05 21:46 /data/spark/kmeans/result The spark-submit tool is used to run the K-Means application. The only change in this command is that the class is now kmeans1: spark-submit --class kmeans1 --master spark://localhost:7077 --executor-memory 700M --total-executor-cores 100 /home/hadoop/spark/kmeans/target/scala-2.10/k-means_2.10-1.0.jar The output from the Spark cluster run is shown to be as follows: Input data rows : 467054 K-Means Cost : 5.40312223450789E7 The previous output shows the input data volume, which looks correct; it also shows the K- Means cost value. The cost is based on the Within Set Sum of Squared Errors (WSSSE) which basically gives a measure how well the found cluster centroids are matching the distribution of the data points. The better they are matching, the lower the cost. The following link https://datasciencelab.wordpress.com/2013/12/27/finding-the-k-in-k-means-clustering/ explains WSSSE and how to find a good value for k in more detail. Next come the three vectors, which describe the data cluster centers with the correct number of dimensions. Remember that these cluster centroid vectors will have the same number of columns as the original vector data: [0.24698249738061878,1.3015883142472253,0.005830116872250263,2.917374778855 5207,1.156645130895448,3.4400290524342454] [0.3321793984152627,1.784137241326256,0.007615970459266097,2.58319870759289 17,119.58366028156011,3.8379106085083468] [0.25247226760684494,1.702510963969387,0.006384899819416975,2.2314042480006 88,52.202897927594805,3.551509158139135] Finally, cluster membership is given for clusters 1 to 3 with cluster 1 (index 0) having the largest membership at 407539 member vectors: (0,407539) (1,12999) (2,46516) To summarize, we saw a practical example that shows how K-means algorithm is used to cluster data with the help of Apache Spark. If you found this post useful, do check out this book Mastering Apache Spark 2.x - Second Edition to learn about the latest enhancements in Apache Spark 2.x, such as interactive querying of live data and unifying DataFrames and Datasets.

0
1
6050

article-image-introduction-raspberry-pi-zero-w-wirelessv

Packt

01 Mar 2018

13 min read

Introduction to Raspberry Pi Zero W Wirelessv

Packt

01 Mar 2018

13 min read

0
0
3097

How-To Tutorials

article-image-4-must-know-levels-in-mongodb-security

Amey Varangaonkar

01 Mar 2018

8 min read

4 must-know levels in MongoDB security

Amey Varangaonkar

01 Mar 2018

8 min read

[box type="note" align="" class="" width=""]The following excerpt is taken from the book Mastering MongoDB 3.x written by Alex Giamas. It presents the techniques and essential concepts needed to tackle even the trickiest problems when it comes to working and administering your MongoDB instance.[/box] Security is a multifaceted goal in a MongoDB cluster. In this article, we will examine different attack vectors and how we can protect MongoDB against them. 1. Authentication in MongoDB Authentication refers to verifying the identity of a client. This prevents impersonating someone else in order to gain access to our data. The simplest way to authenticate is using a username/password pair. This can be done via the shell in two ways: > db.auth( <username>, <password> ) Passing in a comma separated username and password will assume default values for the rest of the fields: > db.auth( { user: <username>, pwd: <password>, mechanism: <authentication mechanism>, digestPassword: <boolean> } ) If we pass a document object we can define more parameters than username/password. The (authentication) mechanism parameter can take several different values with the default being SCRAM-SHA-1. The parameter value MONGODB-CR is used for backwards compatibility with versions earlier than 3.0 MONGODB-X509 is used for TLS/SSL authentication. Users and internal replica set servers can be authenticated using SSL certificates, which are self-generated and signed, or come from a trusted third-party authority. This for the configuration file: security.clusterAuthMode / net.ssl.clusterFile Or like this on the command line: --clusterAuthMode and --sslClusterFile > mongod --replSet <name> --sslMode requireSSL --clusterAuthMode x509 --sslClusterFile <path to membership certificate and key PEM file> --sslPEMKeyFile <path to SSL certificate and key PEM file> --sslCAFile <path to root CA PEM file> MongoDB Enterprise Edition, the paid offering from MongoDB Inc., adds two more options for authentication. The first added option is GSSAPI (Kerberos). Kerberos is a mature and robust authentication system that can be used, among others, for Windows based Active Directory Deployments. The second added option is PLAIN (LDAP SASL). LDAP is just like Kerberos; a mature and robust authentication mechanism. The main consideration when using PLAIN authentication mechanism is that credentials are transmitted in plaintext over the wire. This means that we should secure the path between client and server via VPN or a TSL/SSL connection to avoid a man in the middle stealing our credentials. 2. Authorization in MongoDB After we have configured authentication to verify that users are who they claim they are when connecting to our MongoDB server, we need to configure the rights that each one of them will have in our database. This is the authorization aspect of permissions. MongoDB uses role-based access control to control permissions for different user classes. Every role has permissions to perform some actions on a resource. A resource can be a collection or a database or any collections or any databases. The command's format is: { db: <database>, collection: <collection> } If we specify "" (empty string) for either db or collection it means any db or collection. For example: { db: "mongo_books", collection: "" } This would apply our action in every collection in database mongo_books. Similar to the preceding, we can define: { db: "", collection: "" } We define this to apply our rule to all collections across all databases, except system collections of course. We can also apply rules across an entire cluster as follows: { resource: { cluster : true }, actions: [ "addShard" ] } The preceding example grants privileges for the addShard action (adding a new shard to our system) across the entire cluster. The cluster resource can only be used for actions that affect the entire cluster rather than a collection or database, as for example shutdown, replSetReconfig, appendOplogNote, resync, closeAllDatabases, and addShard. What follows is an extensive list of cluster specific actions and some of the most widely used actions. The list of most widely used actions are: find insert remove update bypassDocumentValidation viewRole / viewUser createRole / dropRole createUser / dropUser inprog killop replSetGetConfig / replSetConfigure / replSetStateChange / resync getShardMap / getShardVersion / listShards / moveChunk / removeShard / addShard dropDatabase / dropIndex / fsync / repairDatabase / shutDown serverStatus / top / validate Cluster-specific actions are: unlock authSchemaUpgrade cleanupOrphaned cpuProfiler inprog invalidateUserCache killop appendOplogNote replSetConfigure replSetGetConfig replSetGetStatus replSetHeartbeat replSetStateChange resync addShard flushRouterConfig getShardMap listShards removeShard shardingState applicationMessage closeAllDatabases connPoolSync fsync getParameter hostInfo logRotate setParameter shutdown touch connPoolStats cursorInfo diagLogging getCmdLineOpts getLog listDatabases netstat serverStatus top If this sounds too complicated that is because it is. The flexibility that MongoDB allows in configuring different actions on resources means that we need to study and understand the extensive lists as described previously. Thankfully, some of the most common actions and resources are bundled in built-in roles. We can use the built-in roles to establish the baseline of permissions that we will give to our users and then fine grain these based on the extensive list. User roles in MongoDB There are two different generic user roles that we can specify: read: A read-only role across non-system collections and the following system collections: system.indexes, system.js, and system.namespaces collections readWrite: A read and modify role across non-system collections and the system.js collection Database administration roles in MongoDB There are three database specific administration roles shown as follows: dbAdmin: The basic admin user role which can perform schema-related tasks, indexing, gathering statistics. A dbAdmin cannot perform user and role management. userAdmin: Create and modify roles and users. This is complementary to the dbAdmin role. dbOwner: Combining readWrite, dbAdmin, and userAdmin roles, this is the most powerful admin user role. Cluster administration roles in MongoDB These are the cluster wide administration roles available: hostManager: Monitor and manage servers in a cluster. clusterManager: Provides management and monitoring actions on the cluster. A user with this role can access the config and local databases, which are used in sharding and replication, respectively. clusterMonitor: Read-only access for monitoring tools provided by MongoDB such as MongoDB Cloud Manager and Ops Manager agent. clusterAdmin: Provides the greatest cluster-management access. This role combines the privileges granted by the clusterManager, clusterMonitor, and hostManager roles. Additionally, the role provides the dropDatabase action. Backup restore roles Role-based authorization roles can be defined in the backup restore granularity level as Well: backup: Provides privileges needed to back-up data. This role provides sufficient privileges to use the MongoDB Cloud Manager backup agent, Ops Manager backup agent, or to use mongodump. restore: Provides privileges needed to restore data with mongorestore without the --oplogReplay option or without system.profile collection data. Roles across all databases Similarly, here are the set of available roles across all databases: readAnyDatabase: Provides the same read-only permissions as read, except it applies to all but the local and config databases in the cluster. The role also provides the listDatabases action on the cluster as a whole. readWriteAnyDatabase: Provides the same read and write permissions as readWrite, except it applies to all but the local and config databases in the cluster. The role also provides the listDatabases action on the cluster as a whole. userAdminAnyDatabase: Provides the same access to user administration operations as userAdmin, except it applies to all but the local and config databases in the cluster. Since the userAdminAnyDatabase role allows users to grant any privilege to any user, including themselves, the role also indirectly provides superuser access. dbAdminAnyDatabase: Provides the same access to database administration operations as dbAdmin, except it applies to all but the local and config databases in the cluster. The role also provides the listDatabases action on the cluster as a whole. Superuser Finally, these are the superuser roles available: root: Provides access to the operations and all the resources of the readWriteAnyDatabase, dbAdminAnyDatabase, userAdminAnyDatabase, clusterAdmin, restore, and backup combined. __internal: Similar to root user, any __internal user can perform any action against any object across the server. 3. Network level security Apart from MongoDB specific security measures, there are best practices established for network level security: Only allow communication between servers and only open the ports that are used for communicating between them. Always use TLS/SSL for communication between servers. This prevents man-inthe- middle attacks impersonating a client. Always use different sets of development, staging, and production environments and security credentials. Ideally, create different accounts for each environment and enable two-factor authentication in both staging and production environments. 4. Auditing security No matter how much we plan our security measures, a second or third pair of eyes from someone outside our organization can give a different view of our security measures and uncover problems that we may not have thought of or underestimated. Don't hesitate to involve security experts / white hat hackers to do penetration testing in your servers. Special cases Medical or financial applications require added levels of security for data privacy reasons. If we are building an application in the healthcare space, accessing users' personal identifiable information, we may need to get HIPAA certified. If we are building an application interacting with payments and managing cardholder information, we may need to become PCI/DSS compliant. The specifics of each certification are outside the scope of this book but it is important to know that MongoDB has use cases in these fields that fulfill the requirements and as such it can be the right tool with proper design beforehand. To sum up, in addition to the best practices listed above, developers and administrators must always use common sense so that security interferes only as much as needed with operational goals. If you found our article useful, make sure to check out this book Mastering MongoDB 3.x to master other MongoDB administration-related techniques and become a true MongoDB expert.

0
0
3202

article-image-getting-started-apache-mesos

Packt

28 Feb 2018

7 min read

Getting Started with Apache Mesos

Packt

28 Feb 2018

7 min read

1 Getting Started with Apache Mesos In this article by David Blomquist author of the book Apache Mesos Cookbook, we will In this chapter, we will provide an overview of the Mesos architecture and recipes for installing Mesos on Linux and Mac. The following are the recipes coverbeed covering in this chapter following topics: Installing Mesos on Ubuntu 16.04 from packages Installing Mesos on Ubuntu 14.04 from packages Installing Mesos on CentOS 7 and RHEL 7 from packages Introduction Apache Mesos is a cluster management software that can distribute the combined resources of many individual servers to applications through frameworks. Mesos is an open source software that is free to download and use in accordance with the Apache License 2.0. This book article will provide the reader with recipes for deploying and developing Apache Mesos and the Mesos frameworks. Mesos can run on Linux, Mac, and Windows. However, we recommend running Mesos on Linux for production deployments and on Mac and Windows for development purposes only. Mesos can be installed from TAR files, Git, source code, or from packages downloaded from repositories. We have chosen to cover only a few installation methods on select operating system versions in this bookarticle. The reasons for covering these specific operating systems and installation methods are as follows: The operating systems natively include a kernel that supports full resource isolation The operating systems are current as of this writing with long term support The operating systems and installation methods do not require workarounds or an excessive number of external repositories The installation methods use the latest stable version of Mesos, whether it is from packages or source code Mesosphere, founded by one of the original developers of Mesos, is a company that provides free open source packages as well as commercial support for Mesos. Mesosphere packages are well maintained and provide an easy way to install and run Mesos. You can run a production Mesos cluster using these packages if you do not require any customization of the build or install process. However, installing from source will allow you to customize the build and install process and enable and disable features. If you want a completely open source and customizable production cluster, we recommend you to install Mesos from source on Ubuntu 16.04 or Ubuntu 14.04. If you want a development environment on a Mac, building from source on OS X is the only way to go. The installation methods that we cover will all provide you with a good base for building out a Mesos development or production environment. We will guide you through the installations in the following sections, but first, you will need to plan for your Mesos deployment. For a Mesos development environment, you only need one host or node. The node can be a physical computer, a virtual machine, or a cloud instance. For a production cluster, we recommend at least three master nodes and as many slave nodes as you will need to support your application frameworks. You can think of the slave nodes as a pool of CPU, RAM, and storage that can be increased by simply adding more slave nodes. Mesos makes it very easy to add slave nodes to an existing cluster as your application requirements increase. At this point, you should know whether you will be building a Mesos development environment or a production cluster and you should have an idea of how many master and slave nodes you will need. The next sections will provide recipes for installing Mesos in the environment of your choice Installing Mesos on Ubuntu 16.04 from Packages In this recipe, we will be installing Mesos .deb packages from the Mesosphere repositories using apt. Getting ready You must be running a 64-bit version of the Ubuntu 16.04 operating system and it should be patched to the most current patch level using apt-get prior to installing the Mesos packages. How to do it… First, download and install the OpenPGP key for the Mesosphere packages: $ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF Now install the Mesosphere repository: $ DISTRO=$(lsb_release -is | tr '‘[:upper:]'’ '‘[:lower:]'’) $ CODENAME=$(lsb_release -cs) $ echo "“deb http://repos.mesosphere.io/${DISTRO} ${CODENAME} main"” | sudo tee /etc/apt/sources.list.d/mesosphere.list Update the apt-get package indexes: $ sudo apt-get update And finally, install Mesos and the included ZooKeeper binaries: $ sudo apt-get -y install mesos At this point, you can start Mesos to do some basic testing. To start the Mesos master and agent (slave) daemons, execute the following: $ sudo service mesos-master start $ sudo service mesos-slave start To validate the Mesos installation, open a browser and point it to http://:5050. Replace with the actual address of the host with the new Mesos installation How it works… The Mesosphere packages provide the software required to run Mesos.. Next, you will configure ZooKeeper, which is covered in Chapter 2. See also If you prefer to build and install Mesos on Ubuntu from source code, we will cover that in an upcoming section in this chapter. Installing Mesos on Ubuntu 14.04 from Packages In this recipe, we will be installing Mesos .deb packages from the Mesosphere repositories using apt. Getting ready You must be running a 64-bit version of the Ubuntu 14.04 operating system and it should be patched to the most current patch level using apt-get prior to installing the Mesos packages. How to do it…… First, download and install the OpenPGP key for the Mesosphere packages: $ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF Now install the Mesosphere repository: $ DISTRO=$(lsb_release -is | tr '‘[:upper:]'’ '‘[:lower:]'’) $ CODENAME=$(lsb_release -cs) $ echo "“deb http://repos.mesosphere.io/${DISTRO} ${CODENAME} main"” | sudo tee /etc/apt/sources.list.d/mesosphere.list Update the apt-get package indexes: $ sudo apt-get update And finally, install Mesos and the included ZooKeeper binaries: $ sudo apt-get -y install mesos At this point, you can either start Mesos to do some basic testing. To start the Mesos master and agent (slave) daemons, execute the following command: $ sudo service mesos-master start $ sudo service mesos-slave start To validate the Mesos installation, open a browser and point it to http://:5050. Replace with the actual address of the host with the new Mesos installation How it works… The Mesosphere packages provide the software required to run Mesos. Next, you will configure ZooKeeper, which is covered in Chapter 2., See also If you prefer to build and install Mesos on Ubuntu from source code, we will cover that in an upcoming section in this chapter. Installing Mesos on CentOS 7 and RHEL 7 from Packages In this recipe, we will be installing Mesos .rpm packages from the Mesosphere repositories using yum. Getting ready Your CentOS 7 or RHEL 7 operating system should be patched to the most current patch level using yum prior to installing the Mesosphere packages. How to do it… First, add the Mesosphere repository: $ sudo rpm -Uvh http://repos.mesosphere.io/el/7/noarch/RPMS/mesosphere-el-repo-7-1.noarch.rpm And now, install Mesos and ZooKeeper: $ sudo yum -y install mesos mesosphere-zookeeper At this point, you can start Mesos to do some basic testing. To start the Mesos master and agent (slave) daemons, execute the following: $ sudo service mesos-master start $ sudo service mesos-slave start To validate the Mesos installation, open a browser and point it to http://:5050. Replace with the actual address of the host with the new Mesos installation How it works… The Mesosphere packages provide the software required to run Mesos. Next, you will configure ZooKeeper, which is covered in Chapter 2. See also If you prefer to build and install Mesos from source code on RHEL 7 or CentOS 7, you can find installation instructions for CentOS 7 on the mesos.apache.org website. We do not cover installing Mesos source code on RHEL7 or CentOS 7 in this book article due to dependencies that require the installation of packages from multiple third-party repositories. Summary In this article we have learned how to install Mesos on Ubuntu 16.04 from packages, how to install Mesos on Ubuntu 14.04 from packages and how to installing Mesos on CentOS 7 and RHEL 7 from packages. Resources for Article: Further resources on this subject: Apache Mesos Cookbook Mastering Mesos

0
0
2221

How-To Tutorials

Packt

28 Feb 2018

15 min read

Defining the business context

Packt

28 Feb 2018

15 min read

Introduction In this article by Ezra Schwartz the author of the book Experience Design for Beginners, will cover topics on how can designers help organizations form an experience strategy. It is a common mistake to believe that companies are formed to create products or services. The primary objective of any company is to be as profitable as possible. Profits are any funds left after all obligations of the company are paid off. Profits help attract new investors and fuel investments in new products--new products that will help make more profits --and the sequence repeats. Without profits, companies just shut down. The question how will our product/s make us profitable? is the generic survival challenge shared by all companies, regardless of its size, industry, or product. The business context of each company, however, may be very unique, based on its circumstances. That business context will influence its product experience strategy, and here is why: Suppose that you are the CEO of a small unknown company, which is similar to Pure Digital, maker of the Flip and that you want to take advantage of cheap memory cards to create a tapeless camcorder. Here are a couple of options for a product approach: Create a camcorder that has all the features of a tape-based camcorder, and a similar look and feel, except that no cassette-tape is need. The product will have a competitive edge--fewer moving parts will make it lighter, cheaper, and more durable, and transfer to computer will also be much simplified. However, the product will have to compete with the established giants in the camcorder market, such as Sony, Panasonic, or Canon. If your product begins to show signs of success, these other companies will release products similar to yours, well before your company could recoup its initial investments Create a product that is a complete departure from typical products. Your product will address all the frustrations that people have with current products, and be a compelling, highly competitive alternative. If it becomes successful, its distinctive look and feel will be associated with your company. By the time the competition release their own products, your product will dominate the market for this segment. However, the product will require coming up with a new design, which will unify the various experience features into a product solution that is attractive and profitable Which option will you choose? Both have their risks and opportunities, but the context may call for option two, which depends on a successful experience design, and indeed, thanks to its experience design of its Flip line of products, Pure Digital was able to popularize and dominate the pocket camcorders market and win over many customers from the tape camcorder market. Another question all companies face is "What's next?". The question may emerge as a result of changes in leadership, opportunities to implement new technologies, new ways to implement the existing technologies, decreased sales of its current product/s, customer dissatisfaction with existing products, or numerous other reasons. Again, it is a matter of each company's business context. Whatever the motivation, although the organization and its products are currently very successful, company leadership may feel that it is necessary for them to invest profits in either developing the next generation of their product/s or creating new ones. Another common aspect of most companies is limited funds and resources for future products. Short and long term priorities must be aligned: Pressures to increase spending: Investing in the future requires spending in the present. Long-term vision requires companies to make significant investments in product research and development, with no guarantee that hopes for future profits in the form of a competitive edge, larger market share will materialize Pressures to reduce spending: Investing in future reduces the funds necessary to compete in the present. Immediate budgetary constraints and competitive pressures require companies to focus on maintaining quarterly profitability, by investing in advertising and other methods, which can improve the performance of the existing profitable products. How well future and present priorities are aligned is a measure of multiple factors. Some would argue that the most important factor is a mutual trust between employees and management. Trust is established with transparency, good communication, fairness in compensation and treatment, and a belief in a shared vision for the present and the future. Mutual trust plays a critical role in flattening the hierarchical structures inherent to companies. Whether any experience design strategy, even a good one, can be a success, also depends on internal company dynamics of trust, because designers, whether employees or consultants, operate within the hierarchical organizational structure. <div class="packt_figure CDPAlignCenter CDPAlign"><img class=" image-border" src="> Companies are hierarchical. The preceding diagram reflects a couple of common characteristics: This is a hierarchical, top-down structure. The larger the company, the more layers separate senior decision-makers from the product. Some executives may not be familiar with critical issues with the product or may disagree about priorities, which sometimes leads to internal fragmentation. There is an unavoidable compartmentalization and specialization, as each group within the company must focus on their responsibilities for a specific aspect of the organization. Regardless of the company's size, this natural division of roles and responsibilities is often the culprit of dispute over issues of product vision and priorities. Experience design is often hierarchically nested under marketing, product management, or engineering department. In highly compartmentalized organizations, trust is sometimes an issue, and designers, who need the cooperation of all units, have a hard time aligning competing visions for the product. Some departments are highly influential, whereas the participation of other departments can be minimal. In such circumstances, it is very difficult to emerge with an experience strategy that satisfies everyone. Consequently, the end result is an unsatisfying shadow of the original vision. Moreover, it is not uncommon for projects to be canceled midstream, due to internal infighting. The following diagram shows the flatting effect a culture of mutual trust has on successful design. Experience strategists, with a mandate from leadership, can reach out to each of the groups within a company, synthesize the various inputs, identify internal gaps of vision and priority, and help all stakeholders embrace a unified vision forward. This unified vision is referred to as the voice of the business <p><img class="image-border" src="" /></p> Over the past couple of decades, a growing number of organizations recognized the value of integrated design by forming in-house experience design departments led by a senior designer who reports directly to the CEO. Celebrated examples in the automotive tech and manufacturing industries include BMW, Apple, and Herman-Miller. The results of an integrated design are expressed in the quality of their products, improved sales performance, and increased customer satisfaction. Although the trend points toward fully integrated design capabilities, there are still many organizations--that for reasons, such as size, budget, or lack of skilled resources--prefer to partner with design consultants, to guide their product experience strategy. Design consultants can be effective when given autonomy and active support from leadership. Business needs - research activities Experience strategists conduct a number of research activities during the first phase of their experience design project. The purpose of the research is two fold: Help designers understand the company's vision and objectives for the product, that is, what is at stake. Based on this context, they work with stakeholders to align product objectives and reach a shared understanding on the goals of design. Once an organizational alignment is achieved, use research insights to develop a product experience strategy, which is aligned with agreed company objectives. The included research activities are as follows: Stakeholder and subject-matter expert (SME) interview Documents review Competitive research Expert product reviews <div class="packt_figure"></div> Stakeholder and subject-matter expert interviewsStakeholders are typically senior executives who have a direct responsibility for, or influence on, the product. Stakeholders include product managers, who manage the planning and day-to-day activities associated with their product, and have a direct decision-making authority over its development. In projects that are important to the company, it is not uncommon for the executive leadership from the chief executive and down to be among the stakeholders due to their influence and authority to the direct overall product strategy. The purpose of stakeholder interviews is to gather and understand the perspective of each individual stakeholder and align the perspectives of all stakeholders around a unified vision around the scope, purpose, outcomes, opportunities and obstacles involved in undertaking a new product development project. Gaps among stakeholders on fundamental project objectives and priorities, will lead to serious trouble down the road. It is best to surfaces such deviations as early as possible, and help stakeholders reach a productive alignment. The purpose of subject-matter experts (SMEs) interviews is to balance the strategic high-level thinking provided by stakeholders, with detailed insights of experienced employees who are recognized for their deep domain expertise. Sales, customer service, and technical support employees have a wealth of operational knowledge of products and customers, which makes them invaluable when analyzing current processes and challenges. Prior to the interviews, the experience strategist prepares an interview guide. The purpose of the guide is to ensure the following<ol< All stakeholders can respond to the same questions All research topics are covered if interviews are conducted by different interviewers Interviews make the best use of stakeholders' valuable time Some of the questions in the guide are general and directed at all participants, others are more specific and focus on the stakeholders specific areas of responsibility. Similar guides are developed for SME interviews. In-person interviews are the best, because they take place at the onset of the project and provide a good opportunity to build rapport and trust between the designer and interviewee. After a formal introduction regarding the purpose of the interview and general questions regarding the person's role and professional experience, the person is asked for their personal assessment and opinions on various topics. Here is a sample of different topics: Objectives and obstacles Prioritized goals for the project What does success look like What kind of obstacles the project is facing, and suggestions to overcome them Competition Who are your top competitors Strength and weaknesses relative to the competition Product features and functionality Which features are missing Differentiating features Features to avoid The interviews are designed to last no more than an hour and are documented with notes and audio recordings, if possible. The answers are compiled and analyzed and the result is presented in a report. The report suggests a unified list of prioritized objectives, and highlights gaps and other risks that have been reported. The report is one of the inputs into the development of the overall product experience strategy. Product expert reviewsProduct expert reviews, sometimes referred to as heuristic evaluations, are professional assessments of a current product, which are performed by design experts for the purpose of identifying usability and user experience issues. The thinking behind the expert review technique is very practical. Experience designers have the expertise to assess the experience quality of a product in a systematic way, using a set of accepted <span class="KeyPACKT">heuristics</span>. A heuristic is a rule of thumb for assessing products. For example, the error prevention heuristic deals with how well the evaluated product prevents the user from making errors. The word heuristic often raises questions about its meaning, and the method has been criticized for its inherent weaknesses due to the following: Subjectivity of the evaluator Expertise and domain knowledge of the evaluator Cultural and demographic background of the evaluator These weaknesses increase the probability that the outcome of an expert evaluation will reflect the biases and preferences of the evaluator, resulting in potentially different conclusions about the same product. Still, expert evaluations, especially if conducted by two evaluators, and their aligned findings, have proven to be an effective tool for experience practitioners who need a fast and cost-effective assessment of a product, particularly digital interfaces. Jacob Nielsen developed the method in the early 1990s. Although there are other sets of heuristics, Nielsen's are probably the most known and commonly used. His initial set of heuristics was first published in his book Usability Engineering and is brought here verbatim, as there is no need for modification: Visibility of system status:The system should always keep users informed about what is going on, through appropriate feedback within reasonable time. Match between system and the real world:The system should speak the user's language, with words, phrases and concepts familiar to the user, rather than system-oriented terms. Follow real-world conventions, making information appear in a natural and logical order. User control and freedom:Users often choose system functions by mistake and will need a clearly marked "emergency exit" to leave the unwanted state without having to go through an extended dialogue. Support undo and redo. Consistency and standards: Users should not have to wonder whether different words, situations, or actions mean the same thing. Follow platform conventions. Error prevention: Even better than good error messages is a careful design which prevents a problem from occurring in the first place. Either eliminate error-prone conditions or check for them and present users with a confirmation option before they commit to the action. ecognition rather than recall: Minimize the user's memory load by making objects, actions, and options visible. The user should not have to remember information from one part of the dialogue to another. Instructions for use of the system should be visible or easily retrievable whenever appropriate. Flexibility and efficiency of use: Accelerators--unseen by the novice user--may often speed up the interaction for the expert user such that the system can cater to both inexperienced and experienced users. Allow users to tailor frequent actions. Aesthetic and minimalist design: Dialogues should not contain information which is irrelevant or rarely needed. Every extra unit of information in a dialogue competes with the relevant units of information and diminishes their relative visibility. Help users recognize, diagnose, and recover from errors: Error messages should be expressed in plain language (no codes), precisely indicate the problem, and constructively suggest a solution. Help and documentatio: Even though it is better if the system can be used without documentation, it may be necessary to provide help and documentation. Any such information should be easy to search, focused on the user's task, list concrete steps to be carried out, and not be too large. Competitive research and analysis: Most companies operate in a competitive marketplace, and having a deep understanding of the competition is critical to the success and survival. Here are few of the questions that a competitive research helps addresses: How does a product or service compare to the competition? What are the strength and weaknesses of competing offerings? What alternatives and choices does the target audience have? Experience strategists use several methods to collect and analyze competitive information. From interviews with stakeholder and SMEs, they know who the direct competition is. In some product categories, such as automobiles and consumer products, companies can reverse-engineer competitive products and try to match or surpass their capabilities. Additionally, designers can develop extensive experience analysis of such competitive products, because they can have a first-hand experience with it. With some hi-tech products, however, some capabilities are cocooned within proprietary software or secret production processes. In these cases, designers can glean the capabilities from an indirect evidence of use. The Internet is a main source of competitive information, from the ability to have a direct access to a product online, to reading help manuals, user guides, bulletin boards, reviews, and analysis in trade publications. Occasionally, unauthorized photos or documents are leaked to the public domain, and they provide clues, sometimes real and sometimes bogus, about a secret upcoming product. Social media too is an important source of competitive data in the form of customers reviews on Yelp, Amazon, or Facebook. With the wealth of this information, a practical strategy to surpass the competition and delivering a better experience can be developed. For example, Uber has been a favorite car hailing service for a while. This service has also generated public controversy and had dissatisfied riders and drivers who are not happy with its policies, including its resistance for tips. By design, a tipping function is not available in the app, which is the primary transaction method between the rider, company and, driver. Research indicates, however, that tipping for the service is a common social norm and that most people tip because it makes them feel better. Not being able to tip places riders in an uncomfortable social setting and stirs negative emotions against Uber. The evidence of dissatisfaction can be easily collected from numerous web sources and from interviewing actual riders and drivers. For Uber competitors, such as Lyft and Curb, by making tipping an integrated part of their apps, provides an immediate competitive edge that improves the experience of both riders, who have an option to reward the driver for their good service, and drivers, who benefit from an increased income. This, and additional improvements over the inferior Uber experience, become a part of an overall experience strategy that is focused on improving the likelihood that riders and drivers will dump Uber in their favor, Summary All the activities described so far were meant to enable designers to fuse business and audience perspectives, and emerge with a product strategy that is focused on user experience as means to product success. Resources for Article: Further resources on this subject: Exploring Experience Design

0
0
4316

How-To Tutorials

article-image-6-index-types-in-postgresql-10-you-should-know

Sugandha Lahoti

28 Feb 2018

13 min read

6 index types in PostgreSQL 10 you should know

Sugandha Lahoti

28 Feb 2018

13 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book Mastering PostgreSQL 10 written by Hans-Jürgen Schönig. This book will help you master the capabilities of PostgreSQL 10 to efficiently manage and maintain your database.[/box] In today’s post, we will learn about the different index types available for sorting in PostgreSQL and also understand how they function. What are index types and why you need them Data types can be sorted in a useful way. Just imagine a polygon. How would you sort these objects in a useful way? Sure, you can sort by the area covered, its length or so, but doing this won't allow you to actually find them using a geometric search. The solution to the problem is to provide more than just one index type. Each index will serve a special purpose and do exactly what is needed. The following six index types are available (as of PostgreSQL 10.0): test=# SELECT * FROM pg_am; amname | amhandler | amtype ---------+-------------+-------- btree | bthandler | i hash | hashhandler | i GiST | GiSThandler | i Gin | ginhandler | i spGiST | spghandler | i brin | brinhandler | i (6 rows) A closer look at the 6 index types in PostgreSQL 10 The following sections will outline the purpose of each index type available in PostgreSQL. Note that there are some extensions that can be used on top of what you can see here. Additional index types available on the web are rum, vodka, and in the future, cognac. Hash indexes Hash indexes have been around for many years. The idea is to hash the input value and store it for later lookups. Having hash indexes actually makes sense. However, before PostgreSQL 10.0, it was not advised to use hash indexes because PostgreSQL had no WAL support for them. In PostgreSQL 10.0, this has changed. Hash indexes are now fully logged and are therefore ready for replication and are considered to be a 100% crash safe. Hash indexes are generally a bit larger than b-tree indexes. Suppose you want to index 4 million integer values. A btree will need around 90 MB of storage to do this. A hash index will need around 125 MB on disk. The assumption made by many people that a hash is super small on the disk is therefore, in many cases, just wrong. GiST indexes Generalized Search Tree (GiST) indexes are highly important index types because they are used for a variety of different things. GiST indexes can be used to implement R-tree behavior and it is even possible to act as b-tree. However, abusing GiST for b-tree indexes is not recommended. Typical use cases for GiST are as follows: Range types Geometric indexes (for example, used by the highly popular PostGIS extension) Fuzzy searching Understanding how GiST works To many people, GiST is still a black box. We will now discuss how GiST works internally. Consider the following diagram: Source: http://leopard.in.ua/assets/images/postgresql/pg_indexes/pg_indexes2.jpg Take a look at the tree. You will see that R1 and R2 are on top. R1 and R2 are the bounding boxes containing everything else. R3, R4, and R5 are contained by R1. R8, R9, and R10 are contained by R3, and so on. A GiST index is therefore hierarchically organized. What you can see in the diagram is that some operations, which are not available in b-trees are supported. Some of those operations are overlaps, left of, right of, and so on. The layout of a GiST tree is ideal for geometric indexing. Extending GiST Of course, it is also possible to come up with your own operator classes. The following strategies are supported: Operation Strategy number Strictly left of 1 Does not extend to right of 2 Overlaps 3 Does not extend to left of 4 Strictly right of 5 Same 6 Contains 7 Contained by 8 Does not extend above 9 Strictly below 10 Strictly above 11 Does not extend below 12 If you want to write operator classes for GiST, a couple of support functions have to be provided. In the case of a b-tree, there is only the same function - GiST indexes provide a lot more: Function Description Support function number consistent The functions determine whether a key satisfies the query qualifier. Internally, strategies are looked up and checked. 1 union Calculate the union of a set of keys. In case of numeric values, simply the upper and lower values or a range are computed. It is especially important to geometries. 2 compress Compute a compressed representation of a key or value. 3 decompress This is the counterpart of the compress function. 4 penalty During insertion, the cost of inserting into the tree will be calculated. The cost determines where the new entry will go inside the tree. Therefore, a good penalty function is key to the good overall performance of the index. 5 picksplit Determines where to move entries in case of a page split. Some entries have to stay on the old page while others will go to the new page being created. Having a good picksplit function is essential to a good index performance. 6 equal The equal function is similar to the same function you have already seen in b-trees. 7 distance Calculates the distance (a number) between a key and the query value. The distance function is optional and is needed in case KNN search is supported. 8 fetch Determine the original representation of a compressed key. This function is needed to handle index only scans as supported by the recent version of PostgreSQL. 9 Implementing operator classes for GiST indexes is usually done in C. If you are interested in a good example, I advise you to check out the btree_GiST module in the contrib directory. It shows how to index standard data types using GiST and is a good source of information as well as inspiration. GIN indexes Generalized inverted (GIN) indexes are a good way to index text. Suppose you want to index a million text documents. A certain word may occur millions of times. In a normal b- tree, this would mean that the key is stored millions of times. Not so in a GIN. Each key (or word) is stored once and assigned to a document list. Keys are organized in a standard b- tree. Each entry will have a document list pointing to all entries in the table having the same key. A GIN index is very small and compact. However, it lacks an important feature found in the b-trees-sorted data. In a GIN, the list of item pointers associated with a certain key is sorted by the position of the row in the table and not by some arbitrary criteria. Extending GIN Just like any other index, GIN can be extended. The following strategies are available: Operation Strategy number Overlap 1 Contains 2 Is contained by 3 Equal 4 On top of this, the following support functions are available: Function Description Support function number compare The compare function is similar to the same function you have seen in b-trees. If two keys are compared, it returns -1 (lower), 0 (equal), or 1 (higher). 1 extractValue Extract keys from a value to be indexed. A value can have many keys. For example, a text value might consist of more than one word. 2 extractQuery Extract keys from a query condition. 3 consistent Check whether a value matches a query condition. 4 comparePartial Compare a partial key from a query and a key from the index. Returns -1, 0, or 1 (similar to the same function supported by b-trees). 5 triConsistent Determine whether a value matches a query condition (ternary variant). It is optional if the consistent function is present. 6 If you are looking for a good example of how to extend GIN, consider looking at the btree_gin module in the PostgreSQL contrib directory. It is a valuable source of information and a good way to start your own implementation. SP-GiST indexes Space partitioned GiST (SP-GiST) has mainly been designed for in-memory use. The reason for this is an SP-GiST stored on disk needs a fairly high number of disk hits to function. Disk hits are way more expensive than just following a couple of pointers in RAM. The beauty is that SP-GiST can be used to implement various types of trees such as quad- trees, k-d trees, and radix trees (tries). The following strategies are provided: Operation Strategy number Strictly left of 1 Strictly right of 5 Same 6 Contained by 8 Strictly below 10 Strictly above 11 To write your own operator classes for SP-GiST, a couple of functions have to be provided: Function Description Support function number config Provides information about the operator class in use 1 choose Figures out how to insert a new value into an inner tuple 2 picksplit Figures out how to partition/split a set of values 3 inner_consistent Determine which subpartitions need to be searched for a query 4 leaf_consistent Determine whether key satisfies the query qualifier 5 BRIN indexes Block range indexes (BRIN) are of great practical use. All indexes discussed until now need quite a lot of disk space. Although a lot of work has gone into shrinking GIN indexes and the like, they still need quite a lot because an index pointer is needed for each entry. So, if there are 10 million entries, there will be 10 million index pointers. Space is the main concern addressed by the BRIN indexes. A BRIN index does not keep an index entry for each tuple but will store the minimum and the maximum value of 128 (default) blocks of data (1 MB). The index is therefore very small but lossy. Scanning the index will return more data than we asked for. PostgreSQL has to filter out these additional rows in a later step. The following example demonstrates how small a BRIN index really is: test=# CREATE INDEX idx_brin ON t_test USING brin(id); CREATE INDEX test=# di+ idx_brin List of relations Schema | Name | Type | Owner | Table | Size --------+----------+-------+-------+--------+-------+------------- public | idx_brin | index | hs | t_test | 48 KB (1 row) In my example, the BRIN index is 2,000 times smaller than a standard b-tree. The question naturally arising now is, why don't we always use BRIN indexes? To answer this kind of question, it is important to reflect on the layout of BRIN; the minimum and maximum value for 1 MB are stored. If the data is sorted (high correlation), BRIN is pretty efficient because we can fetch 1 MB of data, scan it, and we are done. However, what if the data is shuffled? In this case, BRIN won't be able to exclude chunks of data anymore because it is very likely that something close to the overall high and the overall low is within 1 MB of data. Therefore, BRIN is mostly made for highly correlated data. In reality, correlated data is quite likely in data warehousing applications. Often, data is loaded every day and therefore dates can be highly correlated. Extending BRIN indexes BRIN supports the same strategies as a b-tree and therefore needs the same set of operators. The code can be reused nicely: Operation Strategy number Less than 1 Less than or equal 2 Equal 3 Greater than or equal 4 Greater than 5 The support functions needed by BRIN are as follows: Function Description Support function number opcInfo Provide internal information about the indexed columns 1 add_value Add an entry to an existing summary tuple 2 consistent Check whether a value matches a condition 3 union Calculate the union of two summary entries (minimum/maximum values) 4 Adding additional indexes Since PostgreSQL 9.6, there has been an easy way to deploy entirely new index types as extensions. This is pretty cool because if those index types provided by PostgreSQL are not enough, it is possible to add additional ones serving precisely your purpose. The instruction to do this is CREATE ACCESS METHOD: test=# h CREATE ACCESS METHOD Command: CREATE ACCESS METHOD Description: define a new access method Syntax: CREATE ACCESS METHOD name TYPE access_method_type HANDLER handler_function Don't worry too much about this command—just in case you ever deploy your own index type, it will come as a ready-to-use extension. One of these extensions implements bloom filters. Bloom filters are probabilistic data structures. They sometimes return too many rows but never too few. Therefore, a bloom filter is a good method to pre-filter data. How does it work? A bloom filter is defined on a couple of columns. A bitmask is calculated based on the input values, which is then compared to your query. The upside of a bloom filter is that you can index as many columns as you want. The downside is that the entire bloom filter has to be read. Of course, the bloom filter is smaller than the underlying data and so it is, in many cases, very beneficial. To use bloom filters, just activate the extension, which is a part of the PostgreSQL contrib package: test=# CREATE EXTENSION bloom; CREATE EXTENSION As stated previously, the idea behind a bloom filter is that it allows you to index as many columns as you want. In many real-world applications, the challenge is to index many columns without knowing which combinations the user will actually need at runtime. In the case of a large table, it is totally impossible to create standard b-tree indexes on, say, 80 fields or more. A bloom filter might be an alternative in this case: test=# CREATE TABLE t_bloom (x1 int, x2 int, x3 int, x4 int, x5 int, x6 int, x7 int); CREATE TABLE Creating the index is easy: test=# CREATE INDEX idx_bloom ON t_bloom USING bloom(x1, x2, x3, x4, x5, x6, x7); CREATE INDEX If sequential scans are turned off, the index can be seen in action: test=# SET enable_seqscan TO off; SET test=# explain SELECT * FROM t_bloom WHERE x5 = 9 AND x3 = 7; QUERY PLAN ------------------------------------------------------------------------- Bitmap Heap Scan on t_bloom (cost=18.50..22.52 rows=1 width=28) Recheck Cond: ((x3 = 7) AND (x5 = 9)) -> Bitmap Index Scan on idx_bloom (cost=0.00..18.50 rows=1 width=0) Index Cond: ((x3 = 7) AND (x5 = 9)) Note that I have queried a combination of random columns; they are not related to the actual order in the index. The bloom filter will still be beneficial. If you are interested in bloom filters, consider checking out the website: https://en.wikipedia.org/wiki/Bloom_filter. We learnt how to use the indexing features in PostgreSQL and fine-tune the performance of our queries. If you liked our article, check out the book Mastering PostgreSQL 10 to implement advanced administrative tasks such as server maintenance and monitoring, replication, recovery, high availability, etc in PostgreSQL 10.

0
0
10552

article-image-perform-regression-analysis-using-sas

Gebin George

27 Feb 2018

7 min read

How to perform regression analysis using SAS

Gebin George

27 Feb 2018

7 min read

[box type="note" align="" class="" width=""]This article is an excerpt from the book, Big Data Analysis with SAS written by David Pope. This book will help you leverage the power of SAS for data management, analysis and reporting. It contains practical use-cases and real-world examples on predictive modelling, forecasting, optimizing, and reporting your Big Data analysis using SAS.[/box] Today, we will perform regression analysis using SAS in a step-by-step manner with a practical use-case. Regression analysis is one of the earliest predictive techniques most people learn because it can be applied across a wide variety of problems dealing with data that is related in linear and non-linear ways. Linear data is one of the easier use cases, and as such PROC REG is a well-known and often-used procedure to help predict likely outcomes before they happen. The REG procedure provides extensive capabilities for fitting linear regression models that involve individual numeric independent variables. Many other procedures can also fit regression models, but they focus on more specialized forms of regression, such as robust regression, generalized linear regression, nonlinear regression, nonparametric regression, quantile regression, regression modeling of survey data, regression modeling of survival data, and regression modeling of transformed variables. The SAS/STAT procedures that can fit regression models include the ADAPTIVEREG, CATMOD, GAM, GENMOD, GLIMMIX, GLM, GLMSELECT, LIFEREG, LOESS, LOGISTIC, MIXED, NLIN, NLMIXED, ORTHOREG, PHREG, PLS, PROBIT, QUANTREG, QUANTSELECT, REG, ROBUSTREG, RSREG, SURVEYLOGISTIC, SURVEYPHREG, SURVEYREG, TPSPLINE, and TRANSREG procedures. Several procedures in SAS/ETS software also fit regression models. SAS/STAT14.2 / SAS/STAT User's Guide - Introduction to Regression Procedures - Overview: Regression Procedures (http://documentation.sas.com/?cdcId=statcdccdcVersion=14.2 docsetId=statugdocsetTarget=statug_introreg_sect001.htmlocale=enshowBanner=yes). Regression analysis attempts to model the relationship between a response or output variable and a set of input variables. The response is considered the target variable or the variable that one is trying to predict, while the rest of the input variables make up parameters used as input into the algorithm. They are used to derive the predicted value for the response variable. PROC REG One of the easiest ways to determine if regression analysis is applicable to helping you answer a question is if the type of question being asked has only two answers. For example, should a bank lend an applicant money? Yes or no? This is known as a binary response, and as such, regression analysis can be applied to help determine the answer. In the following example, the reader will use the SASHELP.BASEBALL dataset to create a regression model to predict the value of a baseball player's salary. The SASHELP.BASEBALL dataset contains salary and performance information for Major League. Baseball players who played at least one game in both the 1986 and 1987 seasons, excluding pitchers. The salaries (Sports Illustrated, April 20, 1987) are for the 1987 season and the performance measures are from 1986 (Collier Books, The 1987 Baseball Encyclopedia Update). SAS/STAT® 14.2 / SAS/STAT User's Guide - Example 99: Modeling Salaries of Major League Baseball Players (http://documentation.sas.com/ ?cdcId= statcdc cdcVersion= 14.2 docsetId=statugdocsetTarget= statug_ reg_ examples01.htmlocale= en showBanner= yes). Let's first use PROC UNIVARIATE to learn something about this baseball data by submitting the following code: proc univariate data=sashelp.baseball; quit; While reviewing the results of the output, the reader will notice that the variance associated with logSalary, 0.79066, is much less than the variance associated with the actual target variable Salary, 203508. In this case, it makes better sense to attempt to predict the logSalary value of a player instead of Salary. Write the following code in a SAS Studio program section and submit it: proc reg data=sashelp.baseball; id name team league; model logSalary = nAtBat nHits nHome nRuns nRBI YrMajor CrAtBat CrHits CrHome CrRuns CrRbi; Quit; Notice that there are 59 observations as specified in the first output table with at least one of the input variables with missing values; as such those are not used in the development of the regression model. The Root Mean Squared Error (RMSE) and R-square are statistics that typically inform the analyst how good the model is in predicting the target. These range from 0 to 1.0 with higher values typically indicating a better model. The higher the Rsquared values typically indicate a better performing model but sometimes conditions or the data used to train the model over-fit and don't represent the true value of the prediction power of that particular model. Over-fitting can happen when an analyst doesn't have enough real-life data and chooses data or a sample of data that over-presents the target event, and therefore it will produce a poor performing model when using real-world data as input. Since several of the input values appear to have little predictive power on the target, an analyst may decide to drop these variables, thereby reducing the need for that information to make a decent prediction. In this case, it appears we only need to use four input variables. YrMajor, nHits, nRuns, and nAtBat. Modify the code as follows and submit it again: proc reg data=sashelp.baseball; id name team league; model logSalary = YrMajor nHits nRuns nAtBat; Quit; The p-value associated with each of the input variables provides the analyst with an insight into which variables have the biggest impact on helping to predict the target variable. In this case, the smaller the value, the higher the predictive value of the input variable. Both the RMSE and R-square values for this second model are slightly lower than the original. However, the adjusted R-square value is slightly higher. In this case, an analyst may chose to use the second model since it requires much less data and provides basically the same predictive power. Prior to accepting any model, an analyst should determine whether there are a few observations that may be over-influencing the results by investigating the influence and fit diagnostics. The default output from PROC REG provides this type of visual insight: The top-right corner plot, showing the externally studentized residuals (RStudent) by leverage values, shows that there are a few observations with high leverage that may be overly influencing the fit produced. In order to investigate this further, we will add a plots statement to our PROC REG to produce a labeled version of this plot. Type the following code in a SAS Studio program section and submit: proc reg data=sashelp.baseball plots(only label)=(RStudentByLeverage); id name team league; model logSalary = YrMajor nHits nRuns nAtBat; Quit; Sure enough, there are three to five individuals whose input variables may have excessive influence on fitting this model. Let's remove those points and see if the model improves. Type this code in a SAS Studio program section and submit it: proc reg data=sashelp.baseball plots=(residuals(smooth)); where name NOT IN ("Mattingly, Don", "Henderson, Rickey", "Boggs, Wade", "Davis, Eric", "Rose, Pete"); id name team league; model logSalary = YrMajor nHits nRuns nAtBat; Quit; This change, in itself, has not improved the model but actually made the model worse as can be seen by the R-square, 0.5592. However, the plots residuals(smooth) option gives some insights as it pertains to YrMajor; players at the beginning and the end of their careers tend to be paid less compared to others, as can be seen in Figure 4.12: In order to address this lack of fit, an analyst can use polynomials of degree two for this variable, YrMajor. Type the following code in a SAS Studio program section and submit it: data work.baseball; set sashelp.baseball; where name NOT IN ("Mattingly, Don", "Henderson, Rickey", "Boggs, Wade", "Davis, Eric", "Rose, Pete"); YrMajor2 = YrMajor*YrMajor; run; proc reg data=work.baseball; id name team league; model logSalary = YrMajor YrMajor2 nHits nRuns nAtBat; Quit; After removing some outliers and adjusting for the YrMajor variable, the model's predictive power has improved significantly as can be seen in the much improved R-square value of 0.7149. We saw an effective way of performing regression analysis using SAS platform. If you found our post useful, do check out this book Big Data Analysis with SAS to understand other data analysis models and perform them practically using SAS.

0
0
6482

article-image-getting-started-with-the-confluent-platform-apache-kafka-for-enterprise

Amarabha Banerjee

27 Feb 2018

9 min read

Getting started with the Confluent Platform: Apache Kafka for enterprise

Amarabha Banerjee

27 Feb 2018

9 min read

0
0
8049

article-image-implementing-apache-spark-mllib-naive-bayes-to-classify-digital-breath-test-data-for-drunk-driving

Savia Lobo

27 Feb 2018

13 min read

Implementing Apache Spark MLlib Naive Bayes to classify digital breath test data for drunk driving

Savia Lobo

27 Feb 2018

13 min read

0
0
3504

article-image-how-sql-server-handles-data-under-the-hood

Sunith Shetty

27 Feb 2018

11 min read

How SQL Server handles data under the hood

Sunith Shetty

27 Feb 2018

11 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Marek Chmel and Vladimír Mužný titled SQL Server 2017 Administrator's Guide. In this book, you will learn the required skills needed to successfully create, design, and deploy database using SQL Server 2017.[/box] Today, we will explore how SQL Server handles data as it is of utmost importance to get an understanding of what, when, and why data should be backed. Data structures and transaction logging We can think about a database as of physical database structure consisting of tables and indexes. However, this is just a human point of view. From the SQL Server's perspective, a database is a set of precisely structured files described in a form of metadata also saved in database structures. A conceptual imagination of how every database works is very helpful when the database has to be backed up correctly. How data is stored Every database on SQL Server must have at least two files: The primary data file with the usual suffix, mdf The transaction log file with the usual suffix, ldf For lots of databases, this minimal set of files is not enough. When the database contains big amounts of data such as historical tables, or the database has big data contention such as production tracking systems, it's good practise to design more data files. Another situation when a basic set of files is not sufficient can arise when documents or pictures would be saved along with relational data. However, SQL Server still is able to store all of our data in the basic file set, but it can lead to a performance bottlenecks and management issues. That's why we need to know all possible storage types useful for different scenarios of deployment. A complete structure of files is depicted in the following image: Database A relational database is defined as a complex data type consisting of tables with a given amount of columns, and each column has its domain that is actually a data type (such as an integer or a date) optionally complemented by some constraints. From SQL Server's perspective, the database is a record written in metadata and containing the name of the database, properties of the database, and names and locations of all files or folders representing storage for the database. This is the same for user databases as well as for system databases. System databases are created automatically during SQL Server installation and are crucial for correct running of SQL Server. We know five system databases. Database master Database master is crucial for the correct running of SQL Server service. In this database is stored data about logins, all databases and their files, instance configurations, linked servers, and so on. SQL Server finds this database at startup via two startup parameters, -d and -l, followed by paths to mdf and ldf files. These parameters are very important in situations when the administrator wants to move the master's files to a different location. Changing their values is possible in the SQL Server Configuration Manager in the SQL Server service Properties dialog on the tab called startup parameters. Database msdb The database msdb serves as the SQL Server Agent service, Database Mail, and Service Broker. In this database are stored job definitions, operators, and other objects needed for administration automation. This database also stores some logs such as backup and restore events of each database. If this database is corrupted or missing, SQL Server Agent cannot start. Database model Database model can be understood as a template for every new database while it is created. During a database creation (see the CREATE DATABASE statement on MSDN), files are created on defined paths and all objects, data and properties of database model are created, copied, and set into the new database during its creation. This database must always exist on the instance, because when it's corrupted, database tempdb can be created at instance start up! Database tempdb Even if database tempdb seems to be a regular database like many others, it plays a very special role in every SQL Server instance. This database is used by SQL Server itself as well as by developers to save temporary data such as table variables or static cursors. As this database is intended for a short lifespan (temporary data only, which can be stored during execution of stored procedure or until session is disconnected), SQL Server clears this database by truncating all data from it or by dropping and recreating this database every time when it's started. As the tempdb database will never contain durable data, it has some special internal behavior and it's the reason why accessing data in this database is several times faster than accessing durable data in other databases. If this database is corrupted, restart SQL Server. Database resourcedb The resourcedb is fifth in our enumeration and consists of definitions for all system objects of SQL Server, for example, sys.objects. This database is hidden and we don't need to care about it that much. It is not configurable and we don't use regular backup strategies for it. It is always placed in the installation path of SQL Server (to the binn directory) and it's backed up within the filesystem backup. In case of an accident, it is recovered as a part of the filesystem as well. Filegroup Filegroup is an organizational metadata object containing one or more data files. Filegroup does not have its own representation in the filesystem--it's just a group of files. When any database is created, a filegroup called primary is always created. This primary filegroup always contains the primary data file. Filegroups can be divided into the following: Row storage filegroups: These filegroup can contain data files (mdf or ndf). Filestream filegroups: This kind of filegroups can contain not files but folders to store binary data. In-memory filegroup: Only one instance of this kind of filegroup can be created in a database. Internally, it is a special case of filestream filegroup and it's used by SQL Server to persist data from in-memory tables. Every filegroup has three simple properties: Name: This is a descriptive name of the filegroup. The name must fulfill the naming convention criteria. Default: In a set of filegroups of the same type, one of these filegroups has this option set to on. This means that when a new table or index is created without explicitly specified to which filegroup it has to store data in, the default filegroup is used. By default, the primary filegroup is the default one. Read-only: Every filegroup, except the primary filegroup, could be set to read- only. Let's say that a filegroup is created for last year's history. When data is moved from the current period to tables created in this historical filegroup, the filegroup could be set as read-only, and later the filegroup cannot be backed up again and again. It is a very good approach to divide the database into smaller parts-- filegroups with more files. It helps in distributing data across more physical storage and also makes the database more manageable; backups can be done part by part in shorter times, which better fit into a service window. Data files Every database must have at least one data file called primary data file. This file is always bound to the primary filegroup. In this file is all the metadata of the database, such as structure descriptions (could be seen through views such as sys.objects, sys.columns, and others), users, and so on. If the database does not have other data files (in the same or other filegroups), all user data is also stored in this file, but this approach is good enough just for smaller databases. Considering how the volume of data in the database grows over time, it is a good practice to add more data files. These files are called secondary data files. Secondary data files are optional and contain user data only. Both types of data files have the same internal structure. Every file is divided into 8 KB small parts called data pages. SQL Server maintains several types of data pages such as data, data pages, index pages, index allocation maps (IAM) pages to locate data pages of tables or indexes, global allocation map (GAM) and shared global allocation maps (SGAM) pages to address objects in the database, and so on. Regardless of the type of a certain data page, SQL Server uses a data page as the smallest unit of I/O operations between hard disk and memory. Let's describe some common properties: A data page never contains data of several objects Data pages don't know each other (and that's why SQL Server uses IAMs to allocate all pages of an object) Data pages don't have any special physical ordering A data row must always fit in size to a data page These properties could seem to be useless but we have to keep in mind that when we know these properties, we can better optimize and manage our databases. Did you know that a data page is the smallest storage unit that can be restored from backup? As a data page is quite a small storage unit, SQL Server groups data pages into bigger logical units called extents. An extent is a logical allocation unit containing eight coherent data pages. When SQL Server requests data from disk, extents are read into memory. This is the reason why 64 KB NTFS clusters are recommended to format disk volumes for data files. Extents could be uniform or mixed. Uniform extent is a kind of extent containing data pages belonging to one object only; on the other hand, a mixed extent contains data pages of several objects. Transaction log When SQL Server processes any transaction, it works in a way called two-phase commit. When a client starts a transaction by sending a single DML request or by calling the BEGIN TRAN command, SQL Server requests data pages from disk to memory called buffer cache and makes the requested changes in these data pages in memory. When the DML request is fulfilled or the COMMIT command comes from the client, the first phase of the commit is finished, but data pages in memory differ from their original versions in a data file on disk. The data page in memory is in a state called dirty. When a transaction runs, a transaction log file is used by SQL Server for a very detailed chronological description of every single action done during the transaction. This description is called write-ahead-logging, shortly WAL, and is one of the oldest processes known on SQL Server. The second phase of the commit usually does not depend on the client's request and is an internal process called checkpoint. Checkpoint is a periodical action that: searches for dirty pages in buffer cache, saves dirty pages to their original data file location, marks these data pages as clean (or drops them out of memory to free memory space), marks the transaction as checkpoint or inactive in the transaction log. Write-ahead-logging is needed for SQL Server during recovery process. Recovery process is started on every database every time SQL Server service starts. When SQL Server service stops, some pages could remain in a dirty state and they are lost from memory. This can lead to two possible situations: The transaction is completely described in the transaction log, the new content of the data page is lost from memory, and data pages are not changed in the data file The transaction was not completed at the moment SQL Server stopped, so the transaction cannot be completely described in the transaction log as well, data pages in memory were not in a stable state (because the transaction was not finished and SQL Server cannot know if COMMIT or ROLLBACK will occur), and the original version of data pages in data files is intact SQL Server decides these two situations when it's starting. If a transaction is complete in the transaction log but was not marked as checkpoint, SQL Server executes this transaction again with both phases of COMMIT. If the transaction was not complete in the transaction log when SQL Server stopped, SQL Server will never know what was the user's intention with the transaction and the incomplete transaction is erased from the transaction log as if it had never started. The aforementioned described recovery process ensures that every database is in the last known consistent state after SQL Server's startup. It's crucial for DBAs to understand write-ahead-logging when planning a backup strategy because when restoring the database, the administrator has to recognize if it's time to run the recovery process or not. To summarize, we introduced internal data handling as it is important not only during performance backups and restores but also for optimizing a database. If you are interested to know more about how to backup, recover and secure SQL Server, do checkout this book SQL Server 2017 Administrator's Guide.

0
0
5137

Introduction to Raspberry Pi Zero W Wireless

Internationalization and localization

How to compute Discrete Fourier Transform (DFT) using SciPy

How to use MapReduce with Mongo shell

Preparing the Spring Web Development Environment

Implementing Apache Spark K-Means Clustering method on digital breath test data for road safety

Introduction to Raspberry Pi Zero W Wirelessv

4 must-know levels in MongoDB security

Getting Started with Apache Mesos

Defining the business context

Trending Topics

6 index types in PostgreSQL 10 you should know

How to perform regression analysis using SAS

Getting started with the Confluent Platform: Apache Kafka for enterprise

Implementing Apache Spark MLlib Naive Bayes to classify digital breath test data for drunk driving

How SQL Server handles data under the hood