Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-how-to-set-up-a-deep-learning-system-on-amazon-web-services-aws

07 Mar 2018

5 min read

How to set up a Deep Learning System on Amazon Web Services (AWS)

07 Mar 2018

[box type="note" align="" class="" width=""]This article is an excerpt from the book, Deep Learning Essentials written by Wei Di, Anurag Bhardwaj, and Jianing Wei. This book covers popular Python libraries such as Tensorflow, Keras, and more, along with tips to train, deploy and optimize deep learning models in the best possible manner.[/box] Today, we will learn two different methods of setting up a deep learning system using Amazon Web Services (AWS). Setup from scratch We will illustrate how to set up a deep learning environment on an AWS EC2 GPU instance g2.2xlarge running Ubuntu Server 16.04 LTS. For this example, we will use a pre-baked Amazon Machine Image (AMI) which already has a number of software packages installed—making it easier to set up an end-end deep learning system. We will use a publicly available AMI Image ami-b03ffedf, which has following pre-installed Packages: CUDA 8.0 Anaconda 4.20 with Python 3.0 Keras / Theano The first step to setting up the system is to set up an AWS account and spin a new EC2 GPU instance using the AWS web console as (http://console.aws.amazon.com/) shown in figure Choose EC2 AMI: 2. We pick a g2.2xlarge instance type from the next page as shown in figure Choose instance type: 3. After adding a 30 GB of storage as shown in figure Choose storage, we now launch a cluster and assign an EC2 key pair that can allow us to ssh and log in to the box using the provided key pair file: 4. Once the EC2 box is launched, next step is to install relevant software packages.To ensure proper GPU utilization, it is important to ensure graphics drivers are installed first. We will upgrade and install NVIDIA drivers as follows: $ sudo add-apt-repository ppa:graphics-drivers/ppa -y $ sudo apt-get update $ sudo apt-get install -y nvidia-375 nvidia-settings While NVIDIA drivers ensure that host GPU can now be utilized by any deep learning application, it does not provide an easy interface to application developers for easy programming on the device. Various different software libraries exist today that help achieve this task reliably. Open Computing Language (OpenCL) and CUDA are more commonly used in industry. In this book, we use CUDA as an application programming interface for accessing NVIDIA graphics drivers. To install CUDA driver, we first SSH into the EC2 instance and download CUDA 8.0 to our $HOME folder and install from there: $ wget https://developer.nvidia.com/compute/cuda/8.0/Prod2/local_installers/cuda-r epo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb $ sudo dpkg -i cuda-repo-ubuntu1604-8-0-local_8.0.44-1_amd64-deb $ sudo apt-get update $ sudo apt-get install -y cuda nvidia-cuda-toolkit Once the installation is finished, you can run the following command to validate the installation: $ nvidia-smi Now your EC2 box is fully configured to be used for a deep learning development. However, for someone who is not very familiar with deep learning implementation details, building a deep learning system from scratch can be a daunting task. To ease this development, a number of advanced deep learning software frameworks exist, such as Keras and Theano. Both of these frameworks are based on a Python development environment, hence we first install a Python distribution on the box, such as Anaconda: $ wget https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh $ bash Anaconda3-4.2.0-Linux-x86_64.sh Finally, Keras and Theanos are installed using Python’s package manager pip: $ pip install --upgrade --no-deps git+git://github.com/Theano/Theano.git $ pip install keras Once the pip installation is completed successfully, the box is now fully set up for a deep learning development. Setup using Docker The previous section describes getting started from scratch which can be tricky sometimes given continuous changes to software packages and changing links on the web. One way to avoid dependence on links is to use container technology like Docker. In this chapter, we will use the official NVIDIA-Docker image that comes pre-packaged with all the necessary packages and deep learning framework to get you quickly started with deep learning application development: $ sudo add-apt-repository ppa:graphics-drivers/ppa -y $ sudo apt-get update $ sudo apt-get install -y nvidia-375 nvidia-settings nvidia-modprobe We now install Docker Community Edition as follows: $ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - # Verify that the key fingerprint is 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88 $ sudo apt-key fingerprint 0EBFCD88 $ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) Stable" $ sudo apt-get update $ sudo apt-get install -y docker-ce 2. We then install NVIDIA-Docker and its plugin: $ wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nv Idia-docker_1.0.1-1_amd64.deb $ sudo dpkg -i /tmp/nvidia-docker_1.0.1-1_amd64.deb && rm /tmp/nvidia-docker_1.0.1-1_amd64.deb 3. To validate if the installation happened correctly, we use the following command: $ sudo nvidia-docker run --rm nvidia/cuda nvidia-smi 4. Once it’s setup correctly, we can use the official TensorFlow or Theano Docker Image: $ sudo nvidia-docker run -it tensorflow/tensorflow:latest-gpu bash 5. We can run a simple Python program to check if TensorFlow works properly: import tensorflow as tf a = tf.constant(5, tf.float32) b = tf.constant(5, tf.float32) with tf.Session() as sess: sess.run(tf.add(a, b)) # output is 10.0 print("Output of graph computation is = ",output) You should see the TensorFlow output on the screen now as shown in figure Tensorflow sample output: We saw how to set up deep learning system on AWS from scratch and on Docker. If you found our post useful, do check out this book Deep Learning Essentials to optimize deep learning models for better performance output.

0
0
2439

article-image-implementing-matrix-operations-using-scipy-numpy

Pravin Dhandre

07 Mar 2018

5 min read

Implementing matrix operations using SciPy and NumPy

Pravin Dhandre

07 Mar 2018

5 min read

0
0
4746

article-image-implement-long-short-term-memory-lstm-tensorflow

Gebin George

06 Mar 2018

4 min read

Implement Long-short Term Memory (LSTM) with TensorFlow

Gebin George

06 Mar 2018

4 min read

[box type="note" align="" class="" width=""]This article is an excerpt from the book, Deep Learning Essentials written by Wei Di, Anurag Bhardwaj, and Jianing Wei. This book will help you get started with the essentials of deep learning and neural network modeling.[/box] In today’s tutorial, we will look at an example of using LSTM in TensorFlow to perform sentiment classification. The input to LSTM will be a sentence or sequence of words. The output of LSTM will be a binary value indicating a positive sentiment with 1 and a negative sentiment with 0. We will use a many-to-one LSTM architecture for this problem since it maps multiple inputs onto a single output. Figure LSTM: Basic cell architecture shows this architecture in more detail. As shown here, the input takes a sequence of word tokens (in this case, a sequence of three words). Each word token is input at a new time step and is input to the hidden state for the corresponding time step. For example, the word Book is input at time step t and is fed to the hidden state ht: Sentiment analysis: To implement this model in TensorFlow, we need to first define a few variables as follows: batch_size = 4 lstm_units = 16 num_classes = 2 max_sequence_length = 4 embedding_dimension = 64 num_iterations = 1000 As shown previously, batch_size dictates how many sequences of tokens we can input in one batch for training. lstm_units represents the total number of LSTM cells in the network. max_sequence_length represents the maximum possible length of a given sequence. Once defined, we now proceed to initialize TensorFlow-specific data structures for input data as follows: import tensorflow as tf labels = tf.placeholder(tf.float32, [batch_size, num_classes]) raw_data = tf.placeholder(tf.int32, [batch_size, max_sequence_length]) Given we are working with word tokens, we would like to represent them using a good feature representation technique. Let us assume the word embedding representation takes a word token and projects it onto an embedding space of dimension, embedding_dimension. The two-dimensional input data containing raw word tokens is now transformed into a three-dimensional word tensor with the added dimension representing the word embedding. We also use pre-computed word embedding, stored in a word_vectors data structure. We initialize the data structures as follows: data = tf.Variable(tf.zeros([batch_size, max_sequence_length, embedding_dimension]),dtype=tf.float32) data = tf.nn.embedding_lookup(word_vectors,raw_data) Now that the input data is ready, we look at defining the LSTM model. As shown previously, we need to create lstm_units of a basic LSTM cell. Since we need to perform a classification at the end, we wrap the LSTM unit with a dropout wrapper. To perform a full temporal pass of the data on the defined network, we unroll the LSTM using a dynamic_rnn routine of TensorFlow. We also initialize a random weight matrix and a constant value of 0.1 as the bias vector, as follows: weight = tf.Variable(tf.truncated_normal([lstm_units, num_classes])) bias = tf.Variable(tf.constant(0.1, shape=[num_classes])) lstm_cell = tf.contrib.rnn.BasicLSTMCell(lstm_units) wrapped_lstm_cell = tf.contrib.rnn.DropoutWrapper(cell=lstm_cell, output_keep_prob=0.8) output, state = tf.nn.dynamic_rnn(wrapped_lstm_cell, data, dtype=tf.float32) Once the output is generated by the dynamic unrolled RNN, we transpose its shape, multiply it by the weight vector, and add a bias vector to it to compute the final prediction value: output = tf.transpose(output, [1, 0, 2]) last = tf.gather(output, int(output.get_shape()[0]) - 1) prediction = (tf.matmul(last, weight) + bias) weight = tf.cast(weight, tf.float64) last = tf.cast(last, tf.float64) bias = tf.cast(bias, tf.float64) Since the initial prediction needs to be refined, we define an objective function with crossentropy to minimize the loss as follows: loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits (logits=prediction, labels=labels)) optimizer = tf.train.AdamOptimizer().minimize(loss) After this sequence of steps, we have a trained, end-to-end LSTM network for sentiment classification of arbitrary length sentences. To summarize, we saw how effectively we can implement LSTM network using TensorFlow. If you are interested to know more, check out this book Deep Learning Essentials which will help you take first steps in training efficient deep learning models and apply them in various practical scenarios.

0
0
6068

article-image-logistic-regression-using-tensorflow

Packt

06 Mar 2018

9 min read

Logistic Regression Using TensorFlow

Packt

06 Mar 2018

9 min read

In this article, by PKS Prakash and Achyutuni Sri Krishna Rao, authors of R Deep Learning Cookbook we will learn how to Perform logistic regression using TensorFlow. In this recipe, we will cover the application of TensorFlow in setting up a logistic regression model. The example will use a similar dataset to that used in the H2O model setup. (For more resources related to this topic, see here.) What is TensorFlow TensorFlow is another open source library developed by the Google Brain Team to build numerical computation models using data flow graphs. The core of TensorFlow was developed in C++ with the wrapper in Python. The tensorflow package in R gives you access to the TensorFlow API composed of Python modules to execute computation models. TensorFlow supports both CPU- and GPU-based computations. The tensorflow package in R calls the Python tensorflow API for execution, which is essential to install the tensorflow package in both R and Python to make R work. The following are the dependencies for tensorflow: Python 2.7 / 3.x R (>3.2) devtools package in R for installing TensorFlow from GitHub TensorFlow in Python pip Getting ready The code for this section is created on Linux but can be run on any operating system. To start modeling, load the tensorflow package in the environment. R loads the default TensorFlow environment variable and also the NumPy library from Python in the np variable: library("tensorflow") # Load TensorFlow np <- import("numpy") # Load numpy library How to do it... The data is imported using a standard function from R, as shown in the following code. The data is imported using the read.csv file and transformed into the matrix format followed by selecting the features used to model as defined in xFeatures and yFeatures. The next step in TensorFlow is to set up a graph to run optimization: # Loading input and test data xFeatures = c("Temperature", "Humidity", "Light", "CO2", "HumidityRatio") yFeatures = "Occupancy" occupancy_train <- as.matrix(read.csv("datatraining.txt",stringsAsFactors = T)) occupancy_test <- as.matrix(read.csv("datatest.txt",stringsAsFactors = T)) # subset features for modeling and transform to numeric values occupancy_train<-apply(occupancy_train[, c(xFeatures, yFeatures)], 2, FUN=as.numeric) occupancy_test<-apply(occupancy_test[, c(xFeatures, yFeatures)], 2, FUN=as.numeric) # Data dimensions nFeatures<-length(xFeatures) nRow<-nrow(occupancy_train) Before setting up the graph, let's reset the graph using the following command: # Reset the graph tf$reset_default_graph() Additionally, let's start an interactive session as it will allow us to execute variables without referring to the session-to-session object: # Starting session as interactive session sess<-tf$InteractiveSession() Define the logistic regression model in TensorFlow: # Setting-up Logistic regression graph x <- tf$constant(unlist(occupancy_train[, xFeatures]), shape=c(nRow, nFeatures), dtype=np$float32) # W <- tf$Variable(tf$random_uniform(shape(nFeatures, 1L))) b <- tf$Variable(tf$zeros(shape(1L))) y <- tf$matmul(x, W) + b The input feature x is defined as a constant as it will be an input to the system. The weight W and bias b are defined as variables that will be optimized during the optimization process. The y is set up as a symbolic representation between x, W, and b. The weight W is set up to initialize random uniform distribution and b is assigned the value zero. The next step is to set up the cost function for logistic regression: # Setting-up cost function and optimizer y_ <- tf$constant(unlist(occupancy_train[, yFeatures]), dtype="float32", shape=c(nRow, 1L)) cross_entropy<- tf$reduce_mean(tf$nn$sigmoid_cross_entropy_with_logits(labe ls=y_, logits=y, name="cross_entropy")) optimizer <- tf$train$GradientDescentOptimizer(0.15)$minimize(cross_entr opy) # Start a session init <- tf$global_variables_initializer() sess$run(init) Execute the gradient descent algorithm for the optimization of weights using cross entropy as the loss function: # Running optimization for (step in 1:5000) { sess$run(optimizer) if (step %% 20== 0) cat(step, "-", sess$run(W), sess$run(b), "==>", sess$run(cross_entropy), "n") } How it works... The performance of the model can be evaluated using AUC: # Performance on Train library(pROC) ypred <- sess$run(tf$nn$sigmoid(tf$matmul(x, W) + b)) roc_obj <- roc(occupancy_train[, yFeatures], as.numeric(ypred)) # Performance on test nRowt<-nrow(occupancy_test) xt <- tf$constant(unlist(occupancy_test[, xFeatures]), shape=c(nRowt, nFeatures), dtype=np$float32) ypredt <- sess$run(tf$nn$sigmoid(tf$matmul(xt, W) + b)) roc_objt <- roc(occupancy_test[, yFeatures], as.numeric(ypredt)). AUC can be visualized using the plot.auc function from the pROC package, as shown in the screenshot following this command. The performance for training and testing (holdout) is very similar. plot.roc(roc_obj, col = "green", lty=2, lwd=2) plot.roc(roc_objt, add=T, col="red", lty=4, lwd=2) Performance of logistic regression using TensorFlow Visualizing TensorFlow graphs TensorFlow graphs can be visualized using TensorBoard. It is a service that utilizes TensorFlow event files to visualize TensorFlow models as graphs. Graph model visualization in TensorBoard is also used to debug TensorFlow models. Getting ready TensorBoard can be started using the following command in the terminal: $ tensorboard --logdir home/log --port 6006 The following are the major parameters for TensorBoard: --logdir : To map to the directory to load TensorFlow events --debug: To increase log verbosity --host: To define the host to listen to its localhost (127.0.0.1) by default --port: To define the port to which TensorBoard will serve The preceding command will launch the TensorFlow service on localhost at port 6006, as shown in the following screenshot: TensorBoard The tabs on the TensorBoard capture relevant data generated during graph execution. How to do it... The section covers how to visualize TensorFlow models and output in TernsorBoard. To visualize summaries and graphs, data from TensorFlow can be exported using the FileWriter command from the summary module. A default session graph can be added using the following command: # Create Writer Obj for log log_writer = tf$summary$FileWriter('c:/log', sess$graph) The graph for logistic regression developed using the preceding code is shown in the following screenshot: Visualization of the logistic regression graph in TensorBoard Similarly, other variable summaries can be added to the TensorBoard using correct summaries, as shown in the following code: # Adding histogram summary to weight and bias variable w_hist = tf$histogram_summary("weights", W) b_hist = tf$histogram_summary("biases", b) Create a cross entropy evaluation for test. An example script to generate the cross entropy cost function for test and train is shown in the following command: # Set-up cross entropy for test nRowt<-nrow(occupancy_test) xt <- tf$constant(unlist(occupancy_test[, xFeatures]), shape=c(nRowt, nFeatures), dtype=np$float32) ypredt <- tf$nn$sigmoid(tf$matmul(xt, W) + b) yt_ <- tf$constant(unlist(occupancy_test[, yFeatures]), dtype="float32", shape=c(nRowt, 1L)) cross_entropy_tst<- tf$reduce_mean(tf$nn$sigmoid_cross_entropy_with_logits(labe ls=yt_, logits=ypredt, name="cross_entropy_tst")) Add summary variables to be collected: # Add summary ops to collect data w_hist = tf$summary$histogram("weights", W) b_hist = tf$summary$histogram("biases", b) crossEntropySummary<-tf$summary$scalar("costFunction", cross_entropy) crossEntropyTstSummary<- tf$summary$scalar("costFunction_test", cross_entropy_tst) Open the writing object, log_writer. It writes the default graph to the location, c:/log: # Create Writer Obj for log log_writer = tf$summary$FileWriter('c:/log', sess$graph) Run the optimization and collect the summaries: for (step in 1:2500) { sess$run(optimizer) # Evaluate performance on training and test data after 50 Iteration if (step %% 50== 0){ ### Performance on Train ypred <- sess$run(tf$nn$sigmoid(tf$matmul(x, W) + b)) roc_obj <- roc(occupancy_train[, yFeatures], as.numeric(ypred)) ### Performance on Test ypredt <- sess$run(tf$nn$sigmoid(tf$matmul(xt, W) + b)) roc_objt <- roc(occupancy_test[, yFeatures], as.numeric(ypredt)) cat("train AUC: ", auc(roc_obj), " Test AUC: ", auc(roc_objt), "n") # Save summary of Bias and weights log_writer$add_summary(sess$run(b_hist), global_step=step) log_writer$add_summary(sess$run(w_hist), global_step=step) log_writer$add_summary(sess$run(crossEntropySummary), global_step=step) log_writer$add_summary(sess$run(crossEntropyTstSummary), global_step=step) } } Collect all the summaries to a single tensor using the merge_all command from the summary module: summary = tf$summary$merge_all() Write the summaries to the log file using the log_writer object: log_writer = tf$summary$FileWriter('c:/log', sess$graph) summary_str = sess$run(summary) log_writer$add_summary(summary_str, step) log_writer$close() Summary In this article, we have learned how to perform logistic regression using TensorFlow also we have covered the application of TensorFlow in setting up a logistic regression model. Resources for Article: Further resources on this subject: [article] [article] [article]

0
0
3049

article-image-how-to-compute-interpolation-in-scipy

Pravin Dhandre

05 Mar 2018

8 min read

How to Compute Interpolation in SciPy

Pravin Dhandre

05 Mar 2018

8 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book co-authored by L. Felipe Martins, Ruben Oliva Ramos and V Kishore Ayyadevara titled SciPy Recipes. This book provides numerous recipes in mastering common tasks related to SciPy and associated libraries such as NumPy, pandas, and matplotlib.[/box] In today’s tutorial, we will see how to compute and solve polynomial, univariate interpolations using SciPy with detailed process and instructions. In this recipe, we will look at how to compute data polynomial interpolation by applying some important methods which are discussed in detail in the coming How to do it... section. Getting ready We will need to follow some instructions and install the prerequisites. How to do it… Let's get started. In the following steps, we will explain how to compute a polynomial interpolation and the things we need to know: They require the following parameters: points: An ndarray of floats, shape (n, D) data point coordinates. It can be either an array of shape (n, D) or a tuple of ndim arrays. values: An ndarray of float or complex shape (n,) data values. xi: A 2D ndarray of float or tuple of 1D array, shape (M, D). Points at which to interpolate data. method: A {'linear', 'nearest', 'cubic'}—This is an optional method of interpolation. One of the nearest return value is at the data point closest to the point of interpolation. See NearestNDInterpolator for more details. linear tessellates the input point set to n-dimensional simplices, and interpolates linearly on each simplex. See LinearNDInterpolator for more details. cubic (1D): Returns the value determined from a cubic spline. cubic (2D): Returns the value determined from a piecewise cubic, continuously differentiable (C1), and approximately curvature-minimizing polynomial surface. See CloughTocher2DInterpolator for more details. fill_value: float; optional. It is the value used to fill in for requested points outside of the convex hull of the input points. If it is not provided, then the default is nan. This option has no effect on the nearest method. rescale: bool; optional. Rescale points to the unit cube before performing interpolation. This is useful if some of the input dimensions have non-commensurable units and differ by many orders of magnitude. How it works… One can see that the exact result is reproduced by all of the methods to some degree, but for this smooth function, the piecewise cubic interpolant gives the best results: import matplotlib.pyplot as plt import numpy as np methods = [None, 'none', 'nearest', 'bilinear', 'bicubic', 'spline16', 'spline36', 'hanning', 'hamming', 'hermite', 'kaiser', 'quadric', 'catrom', 'gaussian', 'bessel', 'mitchell', 'sinc', 'lanczos'] # Fixing random state for reproducibility np.random.seed(19680801) grid = np.random.rand(4, 4) fig, axes = plt.subplots(3, 6, figsize=(12, 6), subplot_kw={'xticks': [], 'yticks': []}) fig.subplots_adjust(hspace=0.3, wspace=0.05) for ax, interp_method in zip(axes.flat, methods): ax.imshow(grid, interpolation=interp_method, cmap='viridis') ax.set_title(interp_method) plt.show() This is the result of the execution: Univariate interpolation In the next section, we will look at how to solve univariate interpolation. Getting ready We will need to follow some instructions and install the prerequisites. How to do it… The following table summarizes the different univariate interpolation modes coded in SciPy, together with the processes that we may use to resolve them: Finding a cubic spline that interpolates a set of data In this recipe, we will look at how to find a cubic spline that interpolates with the main method of spline. Getting ready We will need to follow some instructions and install the prerequisites. How to do it… We can use the following functions to solve the problems with this parameter: x: array_like, shape (n,). A 1D array containing values of the independent variable. The values must be real, finite, and in strictly increasing order. y: array_like. An array containing values of the dependent variable. It can have an arbitrary number of dimensions, but the length along axis must match the length of x. The values must be finite. axis: int; optional. The axis along which y is assumed to be varying, meaning for x[i], the corresponding values are np.take(y, i, axis=axis). The default is 0. bc_type: String or two-tuple; optional. Boundary condition type. Two additional equations, given by the boundary conditions, are required to determine all coefficients of polynomials on each segment. Refer to: https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.interpolate.CubicSpline.html#r59. If bc_type is a string, then the specified condition will be applied at both ends of a spline. The available conditions are: not-a-knot (default): The first and second segment at a curve end are the same polynomial. This is a good default when there is no information about boundary conditions. periodic: The interpolated function is assumed to be periodic in the period x[-1] - x[0]. The first and last value of y must be identical: y[0] == y[-1]. This boundary condition will result in y'[0] == y'[-1] and y''[0] == y''[-1]. clamped: The first derivatives at the curve ends are zero. Assuming there is a 1D y, bc_type=((1, 0.0), (1, 0.0)) is the same condition. natural: The second derivatives at the curve ends are zero. Assuming there is a 1D y, bc_type=((2, 0.0), (2, 0.0)) is the same condition. If bc_type is two-tuple, the first and the second value will be applied at the curve's start and end respectively. The tuple value can be one of the previously mentioned strings (except periodic) or a tuple (order, deriv_values), allowing us to specify arbitrary derivatives at curve ends: order: The derivative order; it is 1 or 2. deriv_value: An array_like containing derivative values. The shape must be the same as y, excluding the axis dimension. For example, if y is 1D, then deriv_value must be a scalar. If y is 3D with shape (n0, n1, n2) and axis=2, then deriv_value must be 2D and have the shape (n0, n1). extrapolate: {bool, 'periodic', None}; optional. bool, determines whether or not to extrapolate to out-of-bounds points based on first and last intervals, or to return NaNs. periodic, periodic extrapolation is used. If none (default), extrapolate is set to periodic for bc_type='periodic' and to True otherwise. How it works... We have the following example: %pylab inline from scipy.interpolate import CubicSpline import matplotlib.pyplot as plt x = np.arange(10) y = np.sin(x) cs = CubicSpline(x, y) xs = np.arange(-0.5, 9.6, 0.1) plt.figure(figsize=(6.5, 4)) plt.plot(x, y, 'o', label='data') plt.plot(xs, np.sin(xs), label='true') plt.plot(xs, cs(xs), label="S") plt.plot(xs, cs(xs, 1), label="S'") plt.plot(xs, cs(xs, 2), label="S''") plt.plot(xs, cs(xs, 3), label="S'''") plt.xlim(-0.5, 9.5) plt.legend(loc='lower left', ncol=2) plt.show() We can see the result here: We see the next example: theta = 2 * np.pi * np.linspace(0, 1, 5) y = np.c_[np.cos(theta), np.sin(theta)] cs = CubicSpline(theta, y, bc_type='periodic') print("ds/dx={:.1f} ds/dy={:.1f}".format(cs(0, 1)[0], cs(0, 1)[1])) x=0.0 ds/dy=1.0 xs = 2 * np.pi * np.linspace(0, 1, 100) plt.figure(figsize=(6.5, 4)) plt.plot(y[:, 0], y[:, 1], 'o', label='data') plt.plot(np.cos(xs), np.sin(xs), label='true') plt.plot(cs(xs)[:, 0], cs(xs)[:, 1], label='spline') plt.axes().set_aspect('equal') plt.legend(loc='center') plt.show() In the following screenshot, we can see the final result: Defining a B-spline for a given set of control points In the next section, we will look at how to solve B-splines given some controlled data. Getting ready We need to follow some instructions and install the prerequisites. How to do it… Univariate the spline in the B-spline basis Execute the following: S(x)=∑j=0n-1cjBj,k;t(x)S(x)=∑j=0n-1cjBj,k;t(x) Where it's Bj,k;tBj,k;t are B-spline basis functions of degree k and knots t We can use the following parameters: How it works ... Here, we construct a quadratic spline function on the base interval 2 <= x <= 4 and compare it with the naive way of evaluating the spline: from scipy import interpolate import numpy as np import matplotlib.pyplot as plt # sampling x = np.linspace(0, 10, 10) y = np.sin(x) # spline trough all the sampled points tck = interpolate.splrep(x, y) x2 = np.linspace(0, 10, 200) y2 = interpolate.splev(x2, tck) # spline with all the middle points as knots (not working yet) # knots = x[1:-1] # it should be something like this knots = np.array([x[1]]) # not working with above line and just seeing what this line does weights = np.concatenate(([1],np.ones(x.shape[0]-2)*.01,[1])) tck = interpolate.splrep(x, y, t=knots, w=weights) x3 = np.linspace(0, 10, 200) y3 = interpolate.splev(x2, tck) # plot plt.plot(x, y, 'go', x2, y2, 'b', x3, y3,'r') plt.show() Note that outside of the base interval, results differ. This is because BSpline extrapolates the first and last polynomial pieces of B-spline functions active on the base interval. This is the result of solving the problem: We successfully compute numerical computation and find interpolation function using polynomial and univariate interpolation coded in SciPy. If you found this tutorial useful, do check out the book SciPy Recipes to get quick recipes for performing other mathematical operations like differential equation, K-means and Discrete Fourier Transform.

0
0
5528

article-image-implement-reinforcement-learning-tensorflow

Gebin George

05 Mar 2018

3 min read

How to implement Reinforcement Learning with TensorFlow

Gebin George

05 Mar 2018

3 min read

[box type="note" align="" class="" width=""]This article is an excerpt from the book, Deep Learning Essentials co-authored by Wei Di, Anurag Bhardwaj, and Jianing Wei. This book will help you get to grips with the essentials of deep learning by leveraging the power of Python.[/box] In today’s tutorial, we will implement reinforcement learning with TensorFlow-based Qlearning algorithm. We will look at a popular game, FrozenLake, which has an inbuilt environment in the OpenAI gym package. The idea behind the FrozenLake game is quite simple. It consists of 4 x 4 grid blocks, where each block can have one of the following four states: S: Starting point/Safe state F: Frozen surface/Safe state H: Hole/Unsafe state G: Goal/Safe or Terminal state In each of the 16 cells, you can use one of the four actions, namely up/down/left/right, to move to a neighboring state. The goal of the game is to start from state S and end at state G. We will show how we can use a neural network-based Q-learning system to learn a safe path from state S to state G. First, we import the necessary packages and define the game environment: import gym import numpy as np import random import tensorflow as tf env = gym.make('FrozenLake-v0') Once the environment is defined, we can define the network structure that learns the Qvalues. We will use a one-layer neural network with 16 hidden neurons and 4 output neurons as follows: input_matrix = tf.placeholder(shape=[1,16],dtype=tf.float32) weight_matrix = tf.Variable(tf.random_uniform([16,4],0,0.01)) Q_matrix = tf.matmul(input_matrix,weight_matrix) prediction_matrix = tf.argmax(Q_matrix,1) nextQ = tf.placeholder(shape=[1,4],dtype=tf.float32) loss = tf.reduce_sum(tf.square(nextQ - Q_matrix)) train = tf.train.GradientDescentOptimizer(learning_rate=0.05) model = train.minimize(loss) init_op = tf.global_variables_initializer() Now we can choose the action greedily: ip_q = np.zeros(num_states) ip_q[current_state] = 1 a,allQ = sess.run([prediction_matrix,Q_matrix],feed_dict={input_matrix: [ip_q]}) if np.random.rand(1) < sample_epsilon: a[0] = env.action_space.sample() next_state, reward, done, info = env.step(a[0]) ip_q1 = np.zeros(num_states) ip_q1[next_state] = 1 Q1 = sess.run(Q_matrix,feed_dict={input_matrix:[ip_q1]}) maxQ1 = np.max(Q1) targetQ = allQ targetQ[0,a[0]] = reward + y*maxQ1 _,W1 = sess.run([model,weight_matrix],feed_dict={input_matrix: [ip_q],nextQ:targetQ}) Figure RL with Q-learning example shows the sample output of the program when executed. You can see different values of Q matrix as the agent moves from one state to the other. You also notice a value of reward 1 when the agent is in state 15: To summarize, we saw how reinforcement learning can be practically implemented using TensorFlow. If you found this post useful, do check out the book Deep Learning Essentials which will help you fine-tune and optimize your deep learning models for better performance.

0
0
4698

article-image-how-to-use-mapreduce-with-mongo-shell

Amey Varangaonkar

02 Mar 2018

8 min read

How to use MapReduce with Mongo shell

Amey Varangaonkar

02 Mar 2018

8 min read

[box type="note" align="" class="" width=""]The following excerpt is taken from the book Mastering MongoDB 3.x authored by Alex Giamas. This book demonstrates the power of MongoDB to build high performance database solutions with ease.[/box] MongoDB is one of the most popular NoSQL databases in the world and can be combined with various Big Data tools for efficient data processing. In this article we explore interesting features of MongoDB, which has been underappreciated and not widely supported throughout the industry as yet - the ability to write MapReduce natively using shell. MapReduce is a data processing method for getting aggregate results from a large set of data. The main advantage is that it is inherently parallelizable as evidenced by frameworks such as Hadoop. A simple example of MapReduce would be as follows, given that our input books collection is as follows: > db.books.find() { "_id" : ObjectId("592149c4aabac953a3a1e31e"), "isbn" : "101", "name" : "Mastering MongoDB", "price" : 30 } { "_id" : ObjectId("59214bc1aabac954263b24e0"), "isbn" : "102", "name" : "MongoDB in 7 years", "price" : 50 } { "_id" : ObjectId("59214bc1aabac954263b24e1"), "isbn" : "103", "name" : "MongoDB for experts", "price" : 40 } And our map and reduce functions are defined as follows: > var mapper = function() { emit(this.id, 1); }; In this mapper, we simply output a key of the id of each document with a value of 1: > var reducer = function(id, count) { return Array.sum(count); }; In the reducer, we sum across all values (where each one has a value of 1): > db.books.mapReduce(mapper, reducer, { out:"books_count" }); { "result" : "books_count", "timeMillis" : 16613, "counts" : { "input" : 3, "emit" : 3, "reduce" : 1, "output" : 1 }, "ok" : 1 } > db.books_count.find() { "_id" : null, "value" : 3 } > Our final output is a document with no ID, since we didn't output any value for id, and a value of 6, since there are six documents in the input dataset. Using MapReduce, MongoDB will apply map to each input document, emitting key-value pairs at the end of the map phase. Then each reducer will get key-value pairs with the same key as input, processing all multiple values. The reducer's output will be a single key-value pair for each key. Optionally, we can use a finalize function to further process the results of the mapper and reducer. MapReduce functions use JavaScript and run within the mongod process. MapReduce can output inline as a single document, subject to the 16 MB document size limit, or as multiple documents in an output collection. Input and output collections can be sharded. MapReduce concurrency MapReduce operations will place several short-lived locks that should not affect operations. However, at the end of the reduce phase, if we are outputting data to an existing collection, then output actions such as merge, reduce, and replace will take an exclusive global write lock for the whole server, blocking all other writes in the db instance. If we want to avoid that we should invoke MapReduce in the following way: > db.collection.mapReduce( Mapper, Reducer, { out: { merge/reduce: bookOrders, nonAtomic: true } }) We can apply nonAtomic only to merge or reduce actions. replace will just replace the contents of documents in bookOrders, which would not take much time anyway. With the merge action, the new result is merged with the existing result if the output collection already exists. If an existing document has the same key as the new result, then it will overwrite that existing document. With the reduce action, the new result is processed together with the existing result if the output collection already exists. If an existing document has the same key as the new result, it will apply the reduce function to both the new and the existing documents and overwrite the existing document with the result. Although MapReduce has been present since the early versions of MongoDB, it hasn't evolved as much as the rest of the database, resulting in its usage being less than that of specialized MapReduce frameworks such as Hadoop. Incremental MapReduce Incremental MapReduce is a pattern where we use MapReduce to aggregate to previously calculated values. An example would be counting non-distinct users in a collection for different reporting periods (that is, hour, day, month) without the need to recalculate the result every hour. To set up our data for incremental MapReduce we need to do the following: Output our reduce data to a different collection At the end of every hour, query only for the data that got into the collection in the last hour With the output of our reduce data, merge our results with the calculated results from the previous hour Following up on the previous example, let's assume that we have a published field in each of the documents, with our input dataset being: > db.books.find() { "_id" : ObjectId("592149c4aabac953a3a1e31e"), "isbn" : "101", "name" : "Mastering MongoDB", "price" : 30, "published" : ISODate("2017-06-25T00:00:00Z") } { "_id" : ObjectId("59214bc1aabac954263b24e0"), "isbn" : "102", "name" : "MongoDB in 7 years", "price" : 50, "published" : ISODate("2017-06-26T00:00:00Z") } Using our previous example of counting books we would get the following: var mapper = function() { emit(this.id, 1); }; var reducer = function(id, count) { return Array.sum(count); }; > db.books.mapReduce(mapper, reducer, { out: "books_count" }) { "result" : "books_count", "timeMillis" : 16700, "counts" : { "input" : 2, "emit" : 2, "reduce" : 1, "output" : 1 }, "ok" : 1 } > db.books_count.find() { "_id" : null, "value" : 2 } Now we get a third book in our mongo_books collection with a document: { "_id" : ObjectId("59214bc1aabac954263b24e1"), "isbn" : "103", "name" : "MongoDB for experts", "price" : 40, "published" : ISODate("2017-07-01T00:00:00Z") } > db.books.mapReduce( mapper, reducer, { query: { published: { $gte: ISODate('2017-07-01 00:00:00') } }, out: { reduce: "books_count" } } ) > db.books_count.find() { "_id" : null, "value" : 3 } What happened here, is that by querying for documents in July 2017 we only got the new document out of the query and then used its value to reduce the value with the already calculated value of 2 in our books_count document, adding 1 to the final sum of three documents. This example, as contrived as it is, shows a powerful attribute of MapReduce: the ability to re-reduce results to incrementally calculate aggregations over time. Troubleshooting MapReduce Throughout the years, one of the major shortcomings of MapReduce frameworks has been the inherent difficulty in troubleshooting as opposed to simpler non-distributed patterns. Most of the time, the most effective tool is debugging using log statements to verify that output values match our expected values. In the mongo shell, this being a JavaScript shell, this is as simple as outputting using the console.log()function. Diving deeper into MapReduce in MongoDB we can debug both in the map and the reduce phase by overloading the output values. Debugging the mapper phase, we can overload the emit() function to test what the output key values are: > var emit = function(key, value) { print("debugging mapper's emit"); print("key: " + key + " value: " + tojson(value)); } We can then call it manually on a single document to verify that we get back the key-value pair that we would expect: > var myDoc = db.orders.findOne( { _id: ObjectId("50a8240b927d5d8b5891743c") } ); > mapper.apply(myDoc); The reducer function is somewhat more complicated. A MapReduce reducer function must meet the following criteria: It must be idempotent The order of values coming from the mapper function should not matter for the reducer's result The reduce function must return the same type of result as the mapper function We will dissect these following requirements to understand what they really mean: It must be idempotent: MapReduce by design may call the reducer multiple times for the same key with multiple values from the mapper phase. It also doesn't need to reduce single instances of a key as it's just added to the set. The final value should be the same no matter the order of execution. This can be verified by writing our own "verifier" function forcing the reducer to re-reduce or by executing the reducer many, many times: reduce( key, [ reduce(key, valuesArray) ] ) == reduce( key, valuesArray ) It must be commutative: Again, because multiple invocations of the reducer may happen for the same key, if it has multiple values, the following should hold: reduce(key, [ C, reduce(key, [ A, B ]) ] ) == reduce( key, [C, A, B ] ) The order of values coming from the mapper function should not matter for the reducer's result: We can test that the order of values from the mapper doesn't change the output for the reducer by passing in documents to the mapper in a different order and verifying that we get the same results out: reduce( key, [ A, B ] ) == reduce( key, [ B, A ] ) The reduce function must return the same type of result as the mapper function: Hand-in-hand with the first requirement, the type of object that the reduce function returns should be the same as the output of the mapper function. We saw how MapReduce is useful when implemented on a data pipeline. Multiple MapReduce commands can be chained to produce different results. An example would be aggregating data by different reporting periods (hour, day, week, month, year) where we use the output of each more granular reporting period to produce a less granular report. If you found this article useful, make sure to check our book Mastering MongoDB 3.x to get more insights and information about MongoDB’s vast data storage, management and administration capabilities.

0
0
6205

article-image-compute-discrete-fourier-transform-dft-using-scipy

Pravin Dhandre

02 Mar 2018

5 min read

How to compute Discrete Fourier Transform (DFT) using SciPy

Pravin Dhandre

02 Mar 2018

5 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book co-authored by L. Felipe Martins, Ruben Oliva Ramos and V Kishore Ayyadevara titled SciPy Recipes. This book provides numerous recipes to tackle day-to-day challenges associated with scientific computing and data manipulation using SciPy stack.[/box] Today, we will compute Discrete Fourier Transform (DFT) and inverse DFT using SciPy stack. In this article, we will focus majorly on the syntax and the application of DFT in SciPy assuming you are well versed with the mathematics of this concept. Discrete Fourier Transforms A discrete Fourier transform transforms any signal from its time/space domain into a related signal in frequency domain. This allows us to not only analyze the different frequencies of the data, but also enables faster filtering operations, when used properly. It is possible to turn a signal in a frequency domain back to its time/spatial domain, thanks to inverse Fourier transform (IFT). How to do it… To follow with the example, we need to continue with the following steps: The basic routines in the scipy.fftpack module compute the DFT and its inverse, for discrete signals in any dimension—fft, ifft (one dimension), fft2, ifft2 (two dimensions), and fftn, ifftn (any number of dimensions). Verify all these routines assume that the data is complex valued. If we know beforehand that a particular dataset is actually real-valued, and should offer realvalued frequencies, we use rfft and irfft instead, for a faster algorithm. In order to complete with this, these routines are designed so that composition with their inverses always yields the identity. The syntax is the same in all cases, as follows: fft(x[, n, axis, overwrite_x]) The first parameter, x, is always the signal in any array-like form. Note that fft performs one-dimensional transforms. This means that if x happens to be two-dimensional, for example, fft will output another two-dimensional array, where each row is the transform of each row of the original. We can use columns instead, with the optional axis parameter. The rest of the parameters are also optional; n indicates the length of the transform and overwrite_x gets rid of the original data to save memory and resources. We usually play with the n integer when we need to pad the signal with zeros or truncate it. For a higher dimension, n is substituted by shape (a tuple) and axis by axes (another tuple). To better understand the output, it is often useful to shift the zero frequencies to the center of the output arrays with ifftshift. The inverse of this operation, ifftshift, is also included in the module. How it works… The following code shows some of these routines in action when applied to a checkerboard: import numpy from scipy.fftpack import fft,fft2, fftshift import matplotlib.pyplot as plt B=numpy.ones((4,4)); W=numpy.zeros((4,4)) signal = numpy.bmat("B,W;W,B") onedimfft = fft(signal,n=16) twodimfft = fft2(signal,shape=(16,16)) plt.figure() plt.gray() plt.subplot(121,aspect='equal') plt.pcolormesh(onedimfft.real) plt.colorbar(orientation='horizontal') plt.subplot(122,aspect='equal') plt.pcolormesh(fftshift(twodimfft.real)) plt.colorbar(orientation='horizontal') plt.show() Note how the first four rows of the one-dimensional transform are equal (and so are the last four), while the two-dimensional transform (once shifted) presents a peak at the origin and nice symmetries in the frequency domain. In the following screenshot, which has been obtained from the previous code, the image on the left is the fft and the one on the right is the fft2 of a 2 x 2 checkerboard signal: Computing the discrete Fourier transform (DFT) of a data series using the FFT Algorithm In this section, we will see how to compute the discrete Fourier transform and some of its Applications. How to do it… In the following table, we will see the parameters to create a data series using the FFT algorithm: How it works… This code represents computing an FFT discrete Fourier in the main part: np.fft.fft(np.exp(2j * np.pi * np.arange(8) / 8)) array([ -3.44505240e-16 +1.14383329e-17j, 8.00000000e+00 -5.71092652e-15j, 2.33482938e-16 +1.22460635e-16j, 1.64863782e-15 +1.77635684e-15j, 9.95839695e-17 +2.33482938e-16j, 0.00000000e+00 +1.66837030e-15j, 1.14383329e-17 +1.22460635e-16j, -1.64863782e-15 +1.77635684e-15j]) In this example, real input has an FFT that is Hermitian, that is, symmetric in the real part and anti-symmetric in the imaginary part, as described in the numpy.fft documentation. import matplotlib.pyplot as plt t = np.arange(256) sp = np.fft.fft(np.sin(t)) freq = np.fft.fftfreq(t.shape[-1]) plt.plot(freq, sp.real, freq, sp.imag) [<matplotlib.lines.Line2D object at 0x...>, <matplotlib.lines.Line2D object at 0x...>] plt.show() The following screenshot shows how we represent the results: Computing the inverse DFT of a data series In this section, we will learn how to compute the inverse DFT of a data series. How to do it… In this section we will see how to compute the inverse Fourier transform. The returned complex array contains y(0), y(1),..., y(n-1) where: How it works… In this part, we represent the calculous of the DFT: np.fft.ifft([0, 4, 0, 0]) array([ 1.+0.j, 0.+1.j, -1.+0.j, 0.-1.j]) Create and plot a band-limited signal with random phases: import matplotlib.pyplot as plt t = np.arange(400) n = np.zeros((400,), dtype=complex) n[40:60] = np.exp(1j*np.random.uniform(0, 2*np.pi, (20,))) s = np.fft.ifft(n) plt.plot(t, s.real, 'b-', t, s.imag, 'r--') plt.legend(('real', 'imaginary')) plt.show() Then we represent it, as shown in the following screenshot: We successfully explored how to transform signals from time or space domain into frequency domain and vice-versa, allowing you to analyze frequencies in detail. If you found this tutorial useful, do check out the book SciPy Recipes to get hands-on recipes to perform various data science tasks with ease.

0
1
15521

article-image-4-must-know-levels-in-mongodb-security

Amey Varangaonkar

01 Mar 2018

8 min read

4 must-know levels in MongoDB security

Amey Varangaonkar

01 Mar 2018

8 min read

[box type="note" align="" class="" width=""]The following excerpt is taken from the book Mastering MongoDB 3.x written by Alex Giamas. It presents the techniques and essential concepts needed to tackle even the trickiest problems when it comes to working and administering your MongoDB instance.[/box] Security is a multifaceted goal in a MongoDB cluster. In this article, we will examine different attack vectors and how we can protect MongoDB against them. 1. Authentication in MongoDB Authentication refers to verifying the identity of a client. This prevents impersonating someone else in order to gain access to our data. The simplest way to authenticate is using a username/password pair. This can be done via the shell in two ways: > db.auth( <username>, <password> ) Passing in a comma separated username and password will assume default values for the rest of the fields: > db.auth( { user: <username>, pwd: <password>, mechanism: <authentication mechanism>, digestPassword: <boolean> } ) If we pass a document object we can define more parameters than username/password. The (authentication) mechanism parameter can take several different values with the default being SCRAM-SHA-1. The parameter value MONGODB-CR is used for backwards compatibility with versions earlier than 3.0 MONGODB-X509 is used for TLS/SSL authentication. Users and internal replica set servers can be authenticated using SSL certificates, which are self-generated and signed, or come from a trusted third-party authority. This for the configuration file: security.clusterAuthMode / net.ssl.clusterFile Or like this on the command line: --clusterAuthMode and --sslClusterFile > mongod --replSet <name> --sslMode requireSSL --clusterAuthMode x509 --sslClusterFile <path to membership certificate and key PEM file> --sslPEMKeyFile <path to SSL certificate and key PEM file> --sslCAFile <path to root CA PEM file> MongoDB Enterprise Edition, the paid offering from MongoDB Inc., adds two more options for authentication. The first added option is GSSAPI (Kerberos). Kerberos is a mature and robust authentication system that can be used, among others, for Windows based Active Directory Deployments. The second added option is PLAIN (LDAP SASL). LDAP is just like Kerberos; a mature and robust authentication mechanism. The main consideration when using PLAIN authentication mechanism is that credentials are transmitted in plaintext over the wire. This means that we should secure the path between client and server via VPN or a TSL/SSL connection to avoid a man in the middle stealing our credentials. 2. Authorization in MongoDB After we have configured authentication to verify that users are who they claim they are when connecting to our MongoDB server, we need to configure the rights that each one of them will have in our database. This is the authorization aspect of permissions. MongoDB uses role-based access control to control permissions for different user classes. Every role has permissions to perform some actions on a resource. A resource can be a collection or a database or any collections or any databases. The command's format is: { db: <database>, collection: <collection> } If we specify "" (empty string) for either db or collection it means any db or collection. For example: { db: "mongo_books", collection: "" } This would apply our action in every collection in database mongo_books. Similar to the preceding, we can define: { db: "", collection: "" } We define this to apply our rule to all collections across all databases, except system collections of course. We can also apply rules across an entire cluster as follows: { resource: { cluster : true }, actions: [ "addShard" ] } The preceding example grants privileges for the addShard action (adding a new shard to our system) across the entire cluster. The cluster resource can only be used for actions that affect the entire cluster rather than a collection or database, as for example shutdown, replSetReconfig, appendOplogNote, resync, closeAllDatabases, and addShard. What follows is an extensive list of cluster specific actions and some of the most widely used actions. The list of most widely used actions are: find insert remove update bypassDocumentValidation viewRole / viewUser createRole / dropRole createUser / dropUser inprog killop replSetGetConfig / replSetConfigure / replSetStateChange / resync getShardMap / getShardVersion / listShards / moveChunk / removeShard / addShard dropDatabase / dropIndex / fsync / repairDatabase / shutDown serverStatus / top / validate Cluster-specific actions are: unlock authSchemaUpgrade cleanupOrphaned cpuProfiler inprog invalidateUserCache killop appendOplogNote replSetConfigure replSetGetConfig replSetGetStatus replSetHeartbeat replSetStateChange resync addShard flushRouterConfig getShardMap listShards removeShard shardingState applicationMessage closeAllDatabases connPoolSync fsync getParameter hostInfo logRotate setParameter shutdown touch connPoolStats cursorInfo diagLogging getCmdLineOpts getLog listDatabases netstat serverStatus top If this sounds too complicated that is because it is. The flexibility that MongoDB allows in configuring different actions on resources means that we need to study and understand the extensive lists as described previously. Thankfully, some of the most common actions and resources are bundled in built-in roles. We can use the built-in roles to establish the baseline of permissions that we will give to our users and then fine grain these based on the extensive list. User roles in MongoDB There are two different generic user roles that we can specify: read: A read-only role across non-system collections and the following system collections: system.indexes, system.js, and system.namespaces collections readWrite: A read and modify role across non-system collections and the system.js collection Database administration roles in MongoDB There are three database specific administration roles shown as follows: dbAdmin: The basic admin user role which can perform schema-related tasks, indexing, gathering statistics. A dbAdmin cannot perform user and role management. userAdmin: Create and modify roles and users. This is complementary to the dbAdmin role. dbOwner: Combining readWrite, dbAdmin, and userAdmin roles, this is the most powerful admin user role. Cluster administration roles in MongoDB These are the cluster wide administration roles available: hostManager: Monitor and manage servers in a cluster. clusterManager: Provides management and monitoring actions on the cluster. A user with this role can access the config and local databases, which are used in sharding and replication, respectively. clusterMonitor: Read-only access for monitoring tools provided by MongoDB such as MongoDB Cloud Manager and Ops Manager agent. clusterAdmin: Provides the greatest cluster-management access. This role combines the privileges granted by the clusterManager, clusterMonitor, and hostManager roles. Additionally, the role provides the dropDatabase action. Backup restore roles Role-based authorization roles can be defined in the backup restore granularity level as Well: backup: Provides privileges needed to back-up data. This role provides sufficient privileges to use the MongoDB Cloud Manager backup agent, Ops Manager backup agent, or to use mongodump. restore: Provides privileges needed to restore data with mongorestore without the --oplogReplay option or without system.profile collection data. Roles across all databases Similarly, here are the set of available roles across all databases: readAnyDatabase: Provides the same read-only permissions as read, except it applies to all but the local and config databases in the cluster. The role also provides the listDatabases action on the cluster as a whole. readWriteAnyDatabase: Provides the same read and write permissions as readWrite, except it applies to all but the local and config databases in the cluster. The role also provides the listDatabases action on the cluster as a whole. userAdminAnyDatabase: Provides the same access to user administration operations as userAdmin, except it applies to all but the local and config databases in the cluster. Since the userAdminAnyDatabase role allows users to grant any privilege to any user, including themselves, the role also indirectly provides superuser access. dbAdminAnyDatabase: Provides the same access to database administration operations as dbAdmin, except it applies to all but the local and config databases in the cluster. The role also provides the listDatabases action on the cluster as a whole. Superuser Finally, these are the superuser roles available: root: Provides access to the operations and all the resources of the readWriteAnyDatabase, dbAdminAnyDatabase, userAdminAnyDatabase, clusterAdmin, restore, and backup combined. __internal: Similar to root user, any __internal user can perform any action against any object across the server. 3. Network level security Apart from MongoDB specific security measures, there are best practices established for network level security: Only allow communication between servers and only open the ports that are used for communicating between them. Always use TLS/SSL for communication between servers. This prevents man-inthe- middle attacks impersonating a client. Always use different sets of development, staging, and production environments and security credentials. Ideally, create different accounts for each environment and enable two-factor authentication in both staging and production environments. 4. Auditing security No matter how much we plan our security measures, a second or third pair of eyes from someone outside our organization can give a different view of our security measures and uncover problems that we may not have thought of or underestimated. Don't hesitate to involve security experts / white hat hackers to do penetration testing in your servers. Special cases Medical or financial applications require added levels of security for data privacy reasons. If we are building an application in the healthcare space, accessing users' personal identifiable information, we may need to get HIPAA certified. If we are building an application interacting with payments and managing cardholder information, we may need to become PCI/DSS compliant. The specifics of each certification are outside the scope of this book but it is important to know that MongoDB has use cases in these fields that fulfill the requirements and as such it can be the right tool with proper design beforehand. To sum up, in addition to the best practices listed above, developers and administrators must always use common sense so that security interferes only as much as needed with operational goals. If you found our article useful, make sure to check out this book Mastering MongoDB 3.x to master other MongoDB administration-related techniques and become a true MongoDB expert.

0
0
3202

article-image-implementing-apache-spark-k-means-clustering-method-on-digital-breath-test-data-for-road-safety

Savia Lobo

01 Mar 2018

7 min read

Implementing Apache Spark K-Means Clustering method on digital breath test data for road safety

Savia Lobo

01 Mar 2018

7 min read

[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Mastering Apache Spark 2.x - Second Edition written by Romeo Kienzler. In this book, you will learn to use Spark as a big data operating system, understand how to implement advanced analytics on the new APIs, and explore how easy it is to use Spark in day-to-day tasks.[/box] In today’s tutorial, we have used the Road Safety test data from our previous article, to show how one can attempt to find clusters in data using K-Means algorithm with Apache Spark MLlib. Theory on Clustering The K-Means algorithm iteratively attempts to determine clusters within the test data by minimizing the distance between the mean value of cluster center vectors, and the new candidate cluster member vectors. The following equation assumes dataset members that range from X1 to Xn; it also assumes K cluster sets that range from S1 to Sk, where K <= n. K-Means in practice The K-Means MLlib functionality uses the LabeledPoint structure to process its data and so it needs numeric input data. As the same data from the last section is being reused, we will not explain the data conversion again. The only change that has been made in data terms in this section, is that processing in HDFS will now take place under the /data/spark/kmeans/ directory. Additionally, the conversion Scala script for the K-Means example produces a record that is all comma-separated. The development and processing for the K-Means example has taken place under the /home/hadoop/spark/kmeans directory to separate the work from other development. The sbt configuration file is now called kmeans.sbt and is identical to the last example, except for the project name: name := "K-Means" The code for this section can be found in the software package under chapter7K-Means. So, looking at the code for kmeans1.scala, which is stored under kmeans/src/main/scala, some similar actions occur. The import statements refer to the Spark context and configuration. This time, however, the K-Means functionality is being imported from MLlib. Additionally, the application class name has been changed for this example to kmeans1: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.clustering.{KMeans,KMeansModel} object kmeans1 extends App { The same actions are being taken as in the last example to define the data file--to define the Spark configuration and create a Spark context: val hdfsServer = "hdfs://localhost:8020" val hdfsPath = "/data/spark/kmeans/" val dataFile = hdfsServer + hdfsPath + "DigitalBreathTestData2013- MALE2a.csv" val sparkMaster = "spark://localhost:7077" val appName = "K-Means 1" val conf = new SparkConf() conf.setMaster(sparkMaster) conf.setAppName(appName) val sparkCxt = new SparkContext(conf) Next, the CSV data is loaded from the data file and split by comma characters into the VectorData variable: val csvData = sparkCxt.textFile(dataFile) val VectorData = csvData.map { csvLine => Vectors.dense( csvLine.split(',').map(_.toDouble)) } A KMeans object is initialized, and the parameters are set to define the number of clusters and the maximum number of iterations to determine them: val kMeans = new KMeans val numClusters = 3 val maxIterations = 50 Some default values are defined for the initialization mode, number of runs, and Epsilon, which we needed for the K-Means call but did not vary for the processing. Finally, these parameters were set against the KMeans object: val initializationMode = KMeans.K_MEANS_PARALLEL val numRuns = 1 val numEpsilon = 1e-4 kMeans.setK( numClusters ) kMeans.setMaxIterations( maxIterations ) kMeans.setInitializationMode( initializationMode ) kMeans.setRuns( numRuns ) kMeans.setEpsilon( numEpsilon ) We cached the training vector data to improve the performance and trained the KMeans object using the vector data to create a trained K-Means model: VectorData.cache val kMeansModel = kMeans.run( VectorData ) We have computed the K-Means cost and number of input data rows, and have output the results via println statements. The cost value indicates how tightly the clusters are packed and how separate the clusters are: val kMeansCost = kMeansModel.computeCost( VectorData ) println( "Input data rows : " + VectorData.count() ) println( "K-Means Cost : " + kMeansCost ) Next, we have used the K-Means Model to print the cluster centers as vectors for each of the three clusters that were computed: kMeansModel.clusterCenters.foreach{ println } Finally, we use the K-Means model predict function to create a list of cluster membership predictions. We then count these predictions by value to give a count of the data points in each cluster. This shows which clusters are bigger and whether there really are three clusters: val clusterRddInt = kMeansModel.predict( VectorData ) val clusterCount = clusterRddInt.countByValue clusterCount.toList.foreach{ println } } // end object kmeans1 So, in order to run this application, it must be compiled and packaged from the kmeans subdirectory as the Linux pwd command shows here: [hadoop@hc2nn kmeans]$ pwd /home/hadoop/spark/kmeans [hadoop@hc2nn kmeans]$ sbt package Loading /usr/share/sbt/bin/sbt-launch-lib.bash [info] Set current project to K-Means (in build file:/home/hadoop/spark/kmeans/) [info] Compiling 2 Scala sources to /home/hadoop/spark/kmeans/target/scala-2.10/classes... [info] Packaging /home/hadoop/spark/kmeans/target/scala-2.10/k- means_2.10-1.0.jar ... [info] Done packaging. [success] Total time: 20 s, completed Feb 19, 2015 5:02:07 PM Once this packaging is successful, we check HDFS to ensure that the test data is ready. As in the last example, we convert our data to numeric form using the convert.scala file, provided in the software package. We will process the DigitalBreathTestData2013- MALE2a.csv data file in the HDFS directory, /data/spark/kmeans, as follows: [hadoop@hc2nn nbayes]$ hdfs dfs -ls /data/spark/kmeans Found 3 items -rw-r--r-- 3 hadoop supergroup 24645166 2015-02-05 21:11 /data/spark/kmeans/DigitalBreathTestData2013-MALE2.csv -rw-r--r-- 3 hadoop supergroup 5694226 2015-02-05 21:48 /data/spark/kmeans/DigitalBreathTestData2013-MALE2a.csv drwxr-xr-x - hadoop supergroup 0 2015-02-05 21:46 /data/spark/kmeans/result The spark-submit tool is used to run the K-Means application. The only change in this command is that the class is now kmeans1: spark-submit --class kmeans1 --master spark://localhost:7077 --executor-memory 700M --total-executor-cores 100 /home/hadoop/spark/kmeans/target/scala-2.10/k-means_2.10-1.0.jar The output from the Spark cluster run is shown to be as follows: Input data rows : 467054 K-Means Cost : 5.40312223450789E7 The previous output shows the input data volume, which looks correct; it also shows the K- Means cost value. The cost is based on the Within Set Sum of Squared Errors (WSSSE) which basically gives a measure how well the found cluster centroids are matching the distribution of the data points. The better they are matching, the lower the cost. The following link https://datasciencelab.wordpress.com/2013/12/27/finding-the-k-in-k-means-clustering/ explains WSSSE and how to find a good value for k in more detail. Next come the three vectors, which describe the data cluster centers with the correct number of dimensions. Remember that these cluster centroid vectors will have the same number of columns as the original vector data: [0.24698249738061878,1.3015883142472253,0.005830116872250263,2.917374778855 5207,1.156645130895448,3.4400290524342454] [0.3321793984152627,1.784137241326256,0.007615970459266097,2.58319870759289 17,119.58366028156011,3.8379106085083468] [0.25247226760684494,1.702510963969387,0.006384899819416975,2.2314042480006 88,52.202897927594805,3.551509158139135] Finally, cluster membership is given for clusters 1 to 3 with cluster 1 (index 0) having the largest membership at 407539 member vectors: (0,407539) (1,12999) (2,46516) To summarize, we saw a practical example that shows how K-means algorithm is used to cluster data with the help of Apache Spark. If you found this post useful, do check out this book Mastering Apache Spark 2.x - Second Edition to learn about the latest enhancements in Apache Spark 2.x, such as interactive querying of live data and unifying DataFrames and Datasets.

0
1
6050

article-image-6-index-types-in-postgresql-10-you-should-know

Sugandha Lahoti

28 Feb 2018

13 min read

6 index types in PostgreSQL 10 you should know

Sugandha Lahoti

28 Feb 2018

13 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book Mastering PostgreSQL 10 written by Hans-Jürgen Schönig. This book will help you master the capabilities of PostgreSQL 10 to efficiently manage and maintain your database.[/box] In today’s post, we will learn about the different index types available for sorting in PostgreSQL and also understand how they function. What are index types and why you need them Data types can be sorted in a useful way. Just imagine a polygon. How would you sort these objects in a useful way? Sure, you can sort by the area covered, its length or so, but doing this won't allow you to actually find them using a geometric search. The solution to the problem is to provide more than just one index type. Each index will serve a special purpose and do exactly what is needed. The following six index types are available (as of PostgreSQL 10.0): test=# SELECT * FROM pg_am; amname | amhandler | amtype ---------+-------------+-------- btree | bthandler | i hash | hashhandler | i GiST | GiSThandler | i Gin | ginhandler | i spGiST | spghandler | i brin | brinhandler | i (6 rows) A closer look at the 6 index types in PostgreSQL 10 The following sections will outline the purpose of each index type available in PostgreSQL. Note that there are some extensions that can be used on top of what you can see here. Additional index types available on the web are rum, vodka, and in the future, cognac. Hash indexes Hash indexes have been around for many years. The idea is to hash the input value and store it for later lookups. Having hash indexes actually makes sense. However, before PostgreSQL 10.0, it was not advised to use hash indexes because PostgreSQL had no WAL support for them. In PostgreSQL 10.0, this has changed. Hash indexes are now fully logged and are therefore ready for replication and are considered to be a 100% crash safe. Hash indexes are generally a bit larger than b-tree indexes. Suppose you want to index 4 million integer values. A btree will need around 90 MB of storage to do this. A hash index will need around 125 MB on disk. The assumption made by many people that a hash is super small on the disk is therefore, in many cases, just wrong. GiST indexes Generalized Search Tree (GiST) indexes are highly important index types because they are used for a variety of different things. GiST indexes can be used to implement R-tree behavior and it is even possible to act as b-tree. However, abusing GiST for b-tree indexes is not recommended. Typical use cases for GiST are as follows: Range types Geometric indexes (for example, used by the highly popular PostGIS extension) Fuzzy searching Understanding how GiST works To many people, GiST is still a black box. We will now discuss how GiST works internally. Consider the following diagram: Source: http://leopard.in.ua/assets/images/postgresql/pg_indexes/pg_indexes2.jpg Take a look at the tree. You will see that R1 and R2 are on top. R1 and R2 are the bounding boxes containing everything else. R3, R4, and R5 are contained by R1. R8, R9, and R10 are contained by R3, and so on. A GiST index is therefore hierarchically organized. What you can see in the diagram is that some operations, which are not available in b-trees are supported. Some of those operations are overlaps, left of, right of, and so on. The layout of a GiST tree is ideal for geometric indexing. Extending GiST Of course, it is also possible to come up with your own operator classes. The following strategies are supported: Operation Strategy number Strictly left of 1 Does not extend to right of 2 Overlaps 3 Does not extend to left of 4 Strictly right of 5 Same 6 Contains 7 Contained by 8 Does not extend above 9 Strictly below 10 Strictly above 11 Does not extend below 12 If you want to write operator classes for GiST, a couple of support functions have to be provided. In the case of a b-tree, there is only the same function - GiST indexes provide a lot more: Function Description Support function number consistent The functions determine whether a key satisfies the query qualifier. Internally, strategies are looked up and checked. 1 union Calculate the union of a set of keys. In case of numeric values, simply the upper and lower values or a range are computed. It is especially important to geometries. 2 compress Compute a compressed representation of a key or value. 3 decompress This is the counterpart of the compress function. 4 penalty During insertion, the cost of inserting into the tree will be calculated. The cost determines where the new entry will go inside the tree. Therefore, a good penalty function is key to the good overall performance of the index. 5 picksplit Determines where to move entries in case of a page split. Some entries have to stay on the old page while others will go to the new page being created. Having a good picksplit function is essential to a good index performance. 6 equal The equal function is similar to the same function you have already seen in b-trees. 7 distance Calculates the distance (a number) between a key and the query value. The distance function is optional and is needed in case KNN search is supported. 8 fetch Determine the original representation of a compressed key. This function is needed to handle index only scans as supported by the recent version of PostgreSQL. 9 Implementing operator classes for GiST indexes is usually done in C. If you are interested in a good example, I advise you to check out the btree_GiST module in the contrib directory. It shows how to index standard data types using GiST and is a good source of information as well as inspiration. GIN indexes Generalized inverted (GIN) indexes are a good way to index text. Suppose you want to index a million text documents. A certain word may occur millions of times. In a normal b- tree, this would mean that the key is stored millions of times. Not so in a GIN. Each key (or word) is stored once and assigned to a document list. Keys are organized in a standard b- tree. Each entry will have a document list pointing to all entries in the table having the same key. A GIN index is very small and compact. However, it lacks an important feature found in the b-trees-sorted data. In a GIN, the list of item pointers associated with a certain key is sorted by the position of the row in the table and not by some arbitrary criteria. Extending GIN Just like any other index, GIN can be extended. The following strategies are available: Operation Strategy number Overlap 1 Contains 2 Is contained by 3 Equal 4 On top of this, the following support functions are available: Function Description Support function number compare The compare function is similar to the same function you have seen in b-trees. If two keys are compared, it returns -1 (lower), 0 (equal), or 1 (higher). 1 extractValue Extract keys from a value to be indexed. A value can have many keys. For example, a text value might consist of more than one word. 2 extractQuery Extract keys from a query condition. 3 consistent Check whether a value matches a query condition. 4 comparePartial Compare a partial key from a query and a key from the index. Returns -1, 0, or 1 (similar to the same function supported by b-trees). 5 triConsistent Determine whether a value matches a query condition (ternary variant). It is optional if the consistent function is present. 6 If you are looking for a good example of how to extend GIN, consider looking at the btree_gin module in the PostgreSQL contrib directory. It is a valuable source of information and a good way to start your own implementation. SP-GiST indexes Space partitioned GiST (SP-GiST) has mainly been designed for in-memory use. The reason for this is an SP-GiST stored on disk needs a fairly high number of disk hits to function. Disk hits are way more expensive than just following a couple of pointers in RAM. The beauty is that SP-GiST can be used to implement various types of trees such as quad- trees, k-d trees, and radix trees (tries). The following strategies are provided: Operation Strategy number Strictly left of 1 Strictly right of 5 Same 6 Contained by 8 Strictly below 10 Strictly above 11 To write your own operator classes for SP-GiST, a couple of functions have to be provided: Function Description Support function number config Provides information about the operator class in use 1 choose Figures out how to insert a new value into an inner tuple 2 picksplit Figures out how to partition/split a set of values 3 inner_consistent Determine which subpartitions need to be searched for a query 4 leaf_consistent Determine whether key satisfies the query qualifier 5 BRIN indexes Block range indexes (BRIN) are of great practical use. All indexes discussed until now need quite a lot of disk space. Although a lot of work has gone into shrinking GIN indexes and the like, they still need quite a lot because an index pointer is needed for each entry. So, if there are 10 million entries, there will be 10 million index pointers. Space is the main concern addressed by the BRIN indexes. A BRIN index does not keep an index entry for each tuple but will store the minimum and the maximum value of 128 (default) blocks of data (1 MB). The index is therefore very small but lossy. Scanning the index will return more data than we asked for. PostgreSQL has to filter out these additional rows in a later step. The following example demonstrates how small a BRIN index really is: test=# CREATE INDEX idx_brin ON t_test USING brin(id); CREATE INDEX test=# di+ idx_brin List of relations Schema | Name | Type | Owner | Table | Size --------+----------+-------+-------+--------+-------+------------- public | idx_brin | index | hs | t_test | 48 KB (1 row) In my example, the BRIN index is 2,000 times smaller than a standard b-tree. The question naturally arising now is, why don't we always use BRIN indexes? To answer this kind of question, it is important to reflect on the layout of BRIN; the minimum and maximum value for 1 MB are stored. If the data is sorted (high correlation), BRIN is pretty efficient because we can fetch 1 MB of data, scan it, and we are done. However, what if the data is shuffled? In this case, BRIN won't be able to exclude chunks of data anymore because it is very likely that something close to the overall high and the overall low is within 1 MB of data. Therefore, BRIN is mostly made for highly correlated data. In reality, correlated data is quite likely in data warehousing applications. Often, data is loaded every day and therefore dates can be highly correlated. Extending BRIN indexes BRIN supports the same strategies as a b-tree and therefore needs the same set of operators. The code can be reused nicely: Operation Strategy number Less than 1 Less than or equal 2 Equal 3 Greater than or equal 4 Greater than 5 The support functions needed by BRIN are as follows: Function Description Support function number opcInfo Provide internal information about the indexed columns 1 add_value Add an entry to an existing summary tuple 2 consistent Check whether a value matches a condition 3 union Calculate the union of two summary entries (minimum/maximum values) 4 Adding additional indexes Since PostgreSQL 9.6, there has been an easy way to deploy entirely new index types as extensions. This is pretty cool because if those index types provided by PostgreSQL are not enough, it is possible to add additional ones serving precisely your purpose. The instruction to do this is CREATE ACCESS METHOD: test=# h CREATE ACCESS METHOD Command: CREATE ACCESS METHOD Description: define a new access method Syntax: CREATE ACCESS METHOD name TYPE access_method_type HANDLER handler_function Don't worry too much about this command—just in case you ever deploy your own index type, it will come as a ready-to-use extension. One of these extensions implements bloom filters. Bloom filters are probabilistic data structures. They sometimes return too many rows but never too few. Therefore, a bloom filter is a good method to pre-filter data. How does it work? A bloom filter is defined on a couple of columns. A bitmask is calculated based on the input values, which is then compared to your query. The upside of a bloom filter is that you can index as many columns as you want. The downside is that the entire bloom filter has to be read. Of course, the bloom filter is smaller than the underlying data and so it is, in many cases, very beneficial. To use bloom filters, just activate the extension, which is a part of the PostgreSQL contrib package: test=# CREATE EXTENSION bloom; CREATE EXTENSION As stated previously, the idea behind a bloom filter is that it allows you to index as many columns as you want. In many real-world applications, the challenge is to index many columns without knowing which combinations the user will actually need at runtime. In the case of a large table, it is totally impossible to create standard b-tree indexes on, say, 80 fields or more. A bloom filter might be an alternative in this case: test=# CREATE TABLE t_bloom (x1 int, x2 int, x3 int, x4 int, x5 int, x6 int, x7 int); CREATE TABLE Creating the index is easy: test=# CREATE INDEX idx_bloom ON t_bloom USING bloom(x1, x2, x3, x4, x5, x6, x7); CREATE INDEX If sequential scans are turned off, the index can be seen in action: test=# SET enable_seqscan TO off; SET test=# explain SELECT * FROM t_bloom WHERE x5 = 9 AND x3 = 7; QUERY PLAN ------------------------------------------------------------------------- Bitmap Heap Scan on t_bloom (cost=18.50..22.52 rows=1 width=28) Recheck Cond: ((x3 = 7) AND (x5 = 9)) -> Bitmap Index Scan on idx_bloom (cost=0.00..18.50 rows=1 width=0) Index Cond: ((x3 = 7) AND (x5 = 9)) Note that I have queried a combination of random columns; they are not related to the actual order in the index. The bloom filter will still be beneficial. If you are interested in bloom filters, consider checking out the website: https://en.wikipedia.org/wiki/Bloom_filter. We learnt how to use the indexing features in PostgreSQL and fine-tune the performance of our queries. If you liked our article, check out the book Mastering PostgreSQL 10 to implement advanced administrative tasks such as server maintenance and monitoring, replication, recovery, high availability, etc in PostgreSQL 10.

0
0
10552

article-image-getting-know-sql-server-options-disaster-recovery

Sunith Shetty

27 Feb 2018

10 min read

Getting to know SQL Server options for disaster recovery

Sunith Shetty

27 Feb 2018

10 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Marek Chmel and Vladimír Mužný titled SQL Server 2017 Administrator's Guide. This book will help you learn to implement and administer successful database solutions with SQL Server 2017.[/box] Today, we will explore the disaster recovery basics to understand the common terms in high availability and disaster recovery. We will then discuss SQL Server offering for HA/DR options. Disaster recovery basics Disaster recovery (DR) is a set of tools, policies, and procedures, which help us during the recovery of your systems after a disastrous event. Disaster recovery is just a subset of a more complex discipline called business continuity planning, where more variables come in place and you expect more sophisticated plans on how to recover the business operations. With careful planning, you can minimize the effects of the disaster, because you have to keep in mind that it's nearly impossible to completely avoid disasters. The main goal of a disaster recovery plan is to minimize the downtime of our service and to minimize the data loss. To measure these objectives, we use special metrics: Recovery Point and Time Objectives. Recovery Time Objective (RTO) is the maximum time that you can use to recover the system. This time includes your efforts to fix the problem without starting the disaster recovery procedures, the recovery itself, proper testing after the disaster recovery, and the communication to the stakeholders. Once a disaster strikes, clocks are started to measure the disaster recovery actions and the Recovery Time Actual (RTA) metric is calculated. If you manage to recover the system within the Recovery Time Objective, which means that RTA < RTO, then you have met the metrics with a proper combination of the plan and your ability to restore the system. Recovery Point Objective (RPO) is the maximum tolerable period for acceptable data loss. This defines how much data can be lost due to disaster. The Recovery Point Objective has an impact on your implementation of backups, because you plan for a recovery strategy that has specific requirements for your backups. If you can avoid to lose one day of work, you can properly plan your backup types and the frequency of the backups that you need to take. The following image is an illustration of the very concepts that we discussed in the preceding paragraph: When we talk about system availability, we usually use a percentage of the availability time. This availability is a calculated uptime in a given year or month (any date metric that you need) and is usually compared to a following table of "9s". Availability also expresses a tolerable downtime in a given time frame so that the system still meets the availability metric. In the following table, we'll see some basic availability options with tolerable downtime a year and a day: Availability % Downtime a year Downtime a day 90% 36.5 days 2.4 hours 98% 7.3 days 28.8 minutes 99% 3.65 days 14.4 minutes 99.9% 8.76 hours 1.44 minutes 99.99% 52.56 minutes 8.64 seconds 99.999% 5.26 minutes less than 1 second This tolerable downtime consists of the unplanned downtime and can be caused by many factors: Natural Disasters Hardware failures Human errors (accidental deletes, code breakdowns, and so on) Security breaches Malware For these, we can have a mitigation plan in place that will help us reduce the downtime to a tolerable range, and we usually deploy a combination of high availability solutions and disaster recovery solutions so that we can quickly restore the operations. On the other hand, there's a reasonable set of events that require a downtime on your service due to the maintenance and regular operations, which does not affect the availability on your system. These can include the following: New releases of the software Operating system patching SQL Server patching Database maintenance and upgrades Our goal is to have the database online as much as possible, but there will be times when the database will be offline and, from the perspective of the management and operation, we're talking about several keywords such as uptime, downtime, time to repair, and time between failures, as you can see in the following image: It's really critical not only to have a plan for disaster recovery, but also to practice the disaster recovery itself. Many companies follow the procedure of proper disaster recovery plan testing with different types of exercise where each and every aspect of the disaster recovery is carefully evaluated by teams who are familiar with the tools and procedures for a real disaster event. This exercise may have different scope and frequency, as listed in the following points: Tabletop exercises usually involve only a small number of people and focus on a specific aspect of the DR plan. This would be a DBA team drill to recover a single SQL Server or a small set of servers with simulated outage. Medium-sized exercises will involve several teams to practice team communication and interaction. Complex exercises usually simulate larger events such as data center loss, where a new virtual data center is built and all new servers and services are provisioned by the involved teams. Such exercises should be run on a periodic basis so that all the teams and team personnel are up to speed with the disaster recovery plans. SQL Server options for high availability and disaster recovery SQL Server has many features that you can put in place to implement a HA/DR solution that will fit your needs. These features include the following: Always On Failover Cluster Always On Availability Groups Database mirroring Log shipping Replication In many cases, you will combine more of the features together, as your high availability and disaster recovery needs will overlap. HA/DR does not have to be limited to just one single feature. In complex scenarios, you'll plan for a primary high availability solution and secondary high availability solution that will work as your disaster recovery solution at the same time. Always On Failover Cluster An Always On Failover Cluster (FCI) is an instance-level protection mechanism, which is based on top of a Windows Failover Cluster Feature (WFCS). SQL Server instance will be installed across multiple WFCS nodes, where it will appear in the network as a single computer. All the resources that belong to one SQL Server instance (disk, network, names) can be owned by one node of the cluster and, during any planned or unplanned event like a failure of any server component, these can be moved to another node in the cluster to preserve operations and minimize downtime, as shown in the following image: Always On Availability Groups Always On Availability Groups were introduced with SQL Server 2012 to bring a database-level protection to the SQL Server. As with the Always On Failover Cluster, Availability Groups utilize the Windows Failover Cluster feature, but in this case, single SQL Server is not installed as a clustered instance but runs independently on several nodes. These nodes can be configured as Always On Availability Group nodes to host a database, which will be synchronized among the hosts. The replica can be either synchronous or asynchronous, so Always On Availability Groups are a good fit either as a solution for one data center or even distant data centers to keep your data safe. With new SQL Server versions, Always On Availability Groups were enhanced and provide many features for database high availability and disaster recovery scenarios. You can refer to the following image for a better understanding: Database mirroring Database mirroring is an older HA/DR feature available in SQL Server, which provides database-level protection. Mirroring allows synchronizing the databases between two servers, where you can include one more server as a witness server as a failover quorum. Unlike the previous two features, database mirroring does not require any special setup such as Failover Cluster and the configuration can be achieved via SSMS using a wizard available via database properties. Once a transaction occurs on the primary node, it's copied to the second node to the mirrored database. With proper configuration, database mirroring can provide failover options for high availability with automatic client redirection. Database mirroring is not preferred solution for HA/DR, since it's marked as a deprecated feature from SQL Server 2012 and is replaced by Basic Availability Groups on current versions. Log shipping Log shipping configuration, as the name suggests, is a mechanism to keep a database in sync by copying the logs to the remote server. Log shipping, unlike mirroring, is not copying each single transaction, but copies the transactions in batches via transaction log backup on the primary node and log restore on the secondary node. Unlike all previously mentioned features, log shipping does not provide an automatic failover option, so it's considered more as a disaster recovery option than a high availability one. Log shipping operates on regular intervals where three jobs have to run: Backup job to backup the transaction log on the primary system Copy job to copy the backups to the secondary system Restore job to restore the transaction log backup on the secondary system Log shipping supports multiple standby databases, which is quite an advantage compared to database mirroring. One more advantage is the standby configuration for log shipping, which allows read-only access to the secondary database. This is mainly used for many reporting scenarios, where the reporting applications use read-only access and such configuration allows performance offload to the secondary system. Replication Replication is a feature for data movement from one server to another that allows many different scenarios and topologies. Replication uses a model of publisher/subscriber, where the Publisher is the server offering the content via a replication article and subscribers are getting the data. The configuration is more complex compared to mirroring and log shipping features, but allows you much more variety in the configuration for security, performance, and topology. Replication has many benefits and a few of them are as follows: Works on the object level (whereas other features work on database or instance level) Allows merge replication, where more servers synchronize data between each other Allows bi-directional synchronization of data Allows other than SQL Server partners (Oracle, for example) There are several different replication types that can be used with SQL Server, and you can choose them based on the needs for HA/DR options and the data availability requirements on the secondary servers. These options include the following: Snapshot replication Transactional replication Peer-to-peer replication Merge replication We introduced the disaster recovery discipline with the whole big picture of business continuity on SQL Server. Disaster recovery is not only about having backups, but more about the ability to bring the service back to operation after severe failures. We have seen several options that can be used to implement part of disaster recovery on SQL Server--log shipping, replication, and mirroring. To know more about how to design and use an optimal database management strategy, do checkout the book SQL Server 2017 Administrator's Guide.

0
0
4048

article-image-how-to-win-kaggle-competition-with-apache-sparkml

Savia Lobo

27 Feb 2018

11 min read

How to win Kaggle competition with Apache SparkML

Savia Lobo

27 Feb 2018

11 min read

[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Mastering Apache Spark 2.x - Second Edition written by Romeo Kienzler. The book will introduce you to Project Tungsten and Catalyst, two of the major advancements of Apache Spark 2.x.[/box] In today’s tutorial we will show how to take advantage of Apache SparkML to win a Kaggle competition. We'll use an archived competition offered by BOSCH, a German multinational engineering and electronics company, on production line performance data. The data for this competition represents measurement of parts as they move through Bosch's production line. Each part has a unique Id. The goal is to predict which part will fail quality control (represented by a 'Response' = 1). For more details on the competition data you may visit the website: https://www.kaggle.com/c/bosch-production-line-p erformance/data. Data preparation The challenge data comes in three ZIP packages but we only use two of them. One contains categorical data, one contains continuous data, and the last one contains timestamps of measurements, which we will ignore for now. If you extract the data, you'll get three large CSV files. So the first thing that we want to do is re-encode them into parquet in order to be more space-efficient: def convert(filePrefix : String) = { val basePath = "yourBasePath" var df = spark .read .option("header",true) .option("inferSchema", "true") .csv("basePath+filePrefix+".csv") df = df.repartition(1) df.write.parquet(basePath+filePrefix+".parquet") } convert("train_numeric") convert("train_date") convert("train_categorical") First, we define a function convert that just reads the .csv file and rewrites it as a .parquet file. As you can see, this saves a lot of space: Now we read the files in again as DataFrames from the parquet files : var df_numeric = spark.read.parquet(basePath+"train_numeric.parquet") var df_categorical = spark.read.parquet(basePath+"train_categorical.parquet") Here is the output of the same: This is very high-dimensional data; therefore, we will take only a subset of the columns for this illustration: df_categorical.createOrReplaceTempView("dfcat") var dfcat = spark.sql("select Id, L0_S22_F545 from dfcat") In the following picture, you can see the unique categorical values of that column: Now let's do the same with the numerical dataset: df_numeric.createOrReplaceTempView("dfnum") var dfnum = spark.sql("select Id,L0_S0_F0,L0_S0_F2,L0_S0_F4,Response from dfnum") Here is the output of the same: Finally, we rejoin these two relations: var df = dfcat.join(dfnum,"Id") df.createOrReplaceTempView("df") Then we have to do some NA treatment: var df_notnull = spark.sql(""" select Response as label, case when L0_S22_F545 is null then 'NA' else L0_S22_F545 end as L0_S22_F545, case when L0_S0_F0 is null then 0.0 else L0_S0_F0 end as L0_S0_F0, case when L0_S0_F2 is null then 0.0 else L0_S0_F2 end as L0_S0_F2, case when L0_S0_F4 is null then 0.0 else L0_S0_F4 end as L0_S0_F4 from df """) Feature engineering Now it is time to run the first transformer (which is actually an estimator). It is StringIndexer and needs to keep track of an internal mapping table between strings and indexes. Therefore, it is not a transformer but an estimator: import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer} var indexer = new StringIndexer() .setHandleInvalid("skip") .setInputCol("L0_S22_F545") .setOutputCol("L0_S22_F545Index") var indexed = indexer.fit(df_notnull).transform(df_notnull) indexed.printSchema As we can see clearly in the following image, an additional column called L0_S22_F545Index has been created: Finally, let's examine some content of the newly created column and compare it with the source column. We can clearly see how the category string gets transformed into a float index: Now we want to apply OneHotEncoder, which is a transformer, in order to generate better features for our machine learning model: var encoder = new OneHotEncoder() .setInputCol("L0_S22_F545Index") .setOutputCol("L0_S22_F545Vec") var encoded = encoder.transform(indexed) As you can see in the following figure, the newly created column L0_S22_F545Vec contains org.apache.spark.ml.linalg.SparseVector objects, which is a compressed representation of a sparse vector: Note: Sparse vector representations: The OneHotEncoder, as many other algorithms, returns a sparse vector of the org.apache.spark.ml.linalg.SparseVector type as, according to the definition, only one element of the vector can be one, the rest has to remain zero. This gives a lot of opportunity for compression as only the position of the elements that are non-zero has to be known. Apache Spark uses a sparse vector representation in the following format: (l,[p],[v]), where l stands for length of the vector, p for position (this can also be an array of positions), and v for the actual values (this can be an array of values). So if we get (13,[10],[1.0]), as in our earlier example, the actual sparse vector looks like this: (0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0). So now that we are done with our feature engineering, we want to create one overall sparse vector containing all the necessary columns for our machine learner. This is done using VectorAssembler: import org.apache.spark.ml.feature.VectorAssembler import org.apache.spark.ml.linalg.Vectors var vectorAssembler = new VectorAssembler() .setInputCols(Array("L0_S22_F545Vec", "L0_S0_F0", "L0_S0_F2","L0_S0_F4")) setOutputCol("features") var assembled = vectorAssembler.transform(encoded) We basically just define a list of column names and a target column, and the rest is done for us: As the view of the features column got a bit squashed, let's inspect one instance of the feature field in more detail: We can clearly see that we are dealing with a sparse vector of length 16 where positions 0, 13, 14, and 15 are non-zero and contain the following values: 1.0, 0.03, -0.034, and -0.197. Done! Let's create a Pipeline out of these components. Testing the feature engineering pipeline Let's create a Pipeline out of our transformers and estimators: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.PipelineModel //Create an array out of individual pipeline stages var transformers = Array(indexer,encoder,assembled) var pipeline = new Pipeline().setStages(transformers).fit(df_notnull) var transformed = pipeline.transform(df_notnull) Note that the setStages method of Pipeline just expects an array of transformers and estimators, which we had created earlier. As parts of the Pipeline contain estimators, we have to run fit on our DataFrame first. The obtained Pipeline object takes a DataFrame in the transform method and returns the results of the transformations: As expected, we obtain the very same DataFrame as we had while running the stages individually in a sequence. Training the machine learning model Now it's time to add another component to the Pipeline: the actual machine learning algorithm-RandomForest: import org.apache.spark.ml.classification.RandomForestClassifier var rf = new RandomForestClassifier() .setLabelCol("label") .setFeaturesCol("features") var model = new Pipeline().setStages(transformers :+ rf).fit(df_notnull) var result = model.transform(df_notnull) This code is very straightforward. First, we have to instantiate our algorithm and obtain it as a reference in rf. We could have set additional parameters to the model but we'll do this later in an automated fashion in the CrossValidation step. Then, we just add the stage to our Pipeline, fit it, and finally transform. The fit method, apart from running all upstream stages, also calls fit on the RandomForestClassifier in order to train it. The trained model is now contained within the Pipeline and the transform method actually creates our predictions column: As we can see, we've now obtained an additional column called prediction, which contains the output of the RandomForestClassifier model. Of course, we've only used a very limited subset of available features/columns and have also not yet tuned the model, so we don't expect to do very well; however, let's take a look at how we can evaluate our model easily with Apache SparkML. Model evaluation Without evaluation, a model is worth nothing as we don't know how accurately it performs. Therefore, we will now use the built-in BinaryClassificationEvaluator in order to assess prediction performance and a widely used measure called areaUnderROC (going into detail here is beyond the scope of this book): import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator val evaluator = new BinaryClassificationEvaluator() import org.apache.spark.ml.param.ParamMap var evaluatorParamMap = ParamMap(evaluator.metricName -> "areaUnderROC") var aucTraining = evaluator.evaluate(result, evaluatorParamMap) As we can see, there is a built-in class called org.apache.spark.ml.evaluation.BinaryClassificationEvaluator and there are some other classes for other prediction use cases such as RegressionEvaluator or MulticlassClassificationEvaluator. The evaluator takes a parameter map--in this case, we are telling it to use the areaUnderROC metric--and finally, the evaluate method evaluates the result: As we can see, areaUnderROC is 0.5424418446501833. An ideal classifier would return a score of one. So we are only doing a bit better than random guesses but, as already stated, the number of features that we are looking at is fairly limited. Note : In the previous example we are using the areaUnderROC metric which is used for evaluation of binary classifiers. There exist an abundance of other metrics used for different disciplines of machine learning such as accuracy, precision, recall and F1 score. The following provides a good overview http://www.cs.cornell.edu/courses/cs578/2003fa/performance_measures.pdf This areaUnderROC is in fact a very bad value. Let's see if choosing better parameters for our RandomForest model increases this a bit in the next section. This areaUnderROC is in fact a very bad value. Let's see if choosing better parameters for our RandomForest model increases this a bit in the next section. CrossValidation and hyperparameter tuning As explained before, a common step in machine learning is cross-validating your model using testing data against training data and also tweaking the knobs of your machine learning algorithms. Let's use Apache SparkML in order to do this for us, fully automated! First, we have to configure the parameter map and CrossValidator: import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder} var paramGrid = new ParamGridBuilder() .addGrid(rf.numTrees, 3 :: 5 :: 10 :: 30 :: 50 :: 70 :: 100 :: 150 :: Nil) .addGrid(rf.featureSubsetStrategy, "auto" :: "all" :: "sqrt" :: "log2" :: "onethird" :: Nil) .addGrid(rf.impurity, "gini" :: "entropy" :: Nil) .addGrid(rf.maxBins, 2 :: 5 :: 10 :: 15 :: 20 :: 25 :: 30 :: Nil) .addGrid(rf.maxDepth, 3 :: 5 :: 10 :: 15 :: 20 :: 25 :: 30 :: Nil) .build() var crossValidator = new CrossValidator() .setEstimator(new Pipeline().setStages(transformers :+ rf)) .setEstimatorParamMaps(paramGrid) .setNumFolds(5) .setEvaluator(evaluator) var crossValidatorModel = crossValidator.fit(df_notnull) var newPredictions = crossValidatorModel.transform(df_notnull) The org.apache.spark.ml.tuning.ParamGridBuilder is used in order to define the hyperparameter space where the CrossValidator has to search and finally, the org.apache.spark.ml.tuning.CrossValidator takes our Pipeline, the hyperparameter space of our RandomForest classifier, and the number of folds for the CrossValidation as parameters. Now, as usual, we just need to call fit and transform on the CrossValidator and it will basically run our Pipeline multiple times and return a model that performs the best. Do you know how many different models are trained? Well, we have five folds on CrossValidation and five-dimensional hyperparameter space cardinalities between two and eight, so let's do the math: 5 * 8 * 5 * 2 * 7 * 7 = 19600 times! Using the evaluator to assess the quality of the cross-validated and tuned model Now that we've optimized our Pipeline in a fully automatic fashion, let's see how our best model can be obtained: var bestPipelineModel = crossValidatorModel.bestModel.asInstanceOf[PipelineModel] var stages = bestPipelineModel.stages import org.apache.spark.ml.classification.RandomForestClassificationModel val rfStage = stages(stages.length-1).asInstanceOf[RandomForestClassificationModel] rfStage.getNumTrees rfStage.getFeatureSubsetStrategy rfStage.getImpurity rfStage.getMaxBins rfStage.getMaxDepth The crossValidatorModel.bestModel code basically returns the best Pipeline. Now we use bestPipelineModel.stages to obtain the individual stages and obtain the tuned RandomForestClassificationModel using stages(stages.length 1).asInstanceOf[RandomForestClassificationModel]. Note that stages.length-1 addresses the last stage in the Pipeline, which is our RandomForestClassifier. So now, we can basically run evaluator using the best model and see how it performs: You might have noticed that 0.5362224872557545 is less than 0.5424418446501833, as we've obtained before. So why is this the case? Actually, this time we used cross-validation, which means that the model is less likely to over fit and therefore the score is a bit lower. So let's take a look at the parameters of the best model: Note that we've limited the hyperparameter space, so numTrees, maxBins, and maxDepth have been limited to five, and bigger trees will most likely perform better. So feel free to play around with this code and add features, and also use a bigger hyperparameter space, say, bigger trees. Finally, we've applied the concepts that we discussed on a real dataset from a Kaggle competition, which is a good starting point for your own machine learning project with Apache SparkML. If you found our post useful, do check out this book Mastering Apache Spark 2.x - Second Edition to know more about advanced analytics on your Big Data with latest Apache Spark 2.x.

0
0
3064

article-image-how-sql-server-handles-data-under-the-hood

Sunith Shetty

27 Feb 2018

11 min read

How SQL Server handles data under the hood

Sunith Shetty

27 Feb 2018

11 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Marek Chmel and Vladimír Mužný titled SQL Server 2017 Administrator's Guide. In this book, you will learn the required skills needed to successfully create, design, and deploy database using SQL Server 2017.[/box] Today, we will explore how SQL Server handles data as it is of utmost importance to get an understanding of what, when, and why data should be backed. Data structures and transaction logging We can think about a database as of physical database structure consisting of tables and indexes. However, this is just a human point of view. From the SQL Server's perspective, a database is a set of precisely structured files described in a form of metadata also saved in database structures. A conceptual imagination of how every database works is very helpful when the database has to be backed up correctly. How data is stored Every database on SQL Server must have at least two files: The primary data file with the usual suffix, mdf The transaction log file with the usual suffix, ldf For lots of databases, this minimal set of files is not enough. When the database contains big amounts of data such as historical tables, or the database has big data contention such as production tracking systems, it's good practise to design more data files. Another situation when a basic set of files is not sufficient can arise when documents or pictures would be saved along with relational data. However, SQL Server still is able to store all of our data in the basic file set, but it can lead to a performance bottlenecks and management issues. That's why we need to know all possible storage types useful for different scenarios of deployment. A complete structure of files is depicted in the following image: Database A relational database is defined as a complex data type consisting of tables with a given amount of columns, and each column has its domain that is actually a data type (such as an integer or a date) optionally complemented by some constraints. From SQL Server's perspective, the database is a record written in metadata and containing the name of the database, properties of the database, and names and locations of all files or folders representing storage for the database. This is the same for user databases as well as for system databases. System databases are created automatically during SQL Server installation and are crucial for correct running of SQL Server. We know five system databases. Database master Database master is crucial for the correct running of SQL Server service. In this database is stored data about logins, all databases and their files, instance configurations, linked servers, and so on. SQL Server finds this database at startup via two startup parameters, -d and -l, followed by paths to mdf and ldf files. These parameters are very important in situations when the administrator wants to move the master's files to a different location. Changing their values is possible in the SQL Server Configuration Manager in the SQL Server service Properties dialog on the tab called startup parameters. Database msdb The database msdb serves as the SQL Server Agent service, Database Mail, and Service Broker. In this database are stored job definitions, operators, and other objects needed for administration automation. This database also stores some logs such as backup and restore events of each database. If this database is corrupted or missing, SQL Server Agent cannot start. Database model Database model can be understood as a template for every new database while it is created. During a database creation (see the CREATE DATABASE statement on MSDN), files are created on defined paths and all objects, data and properties of database model are created, copied, and set into the new database during its creation. This database must always exist on the instance, because when it's corrupted, database tempdb can be created at instance start up! Database tempdb Even if database tempdb seems to be a regular database like many others, it plays a very special role in every SQL Server instance. This database is used by SQL Server itself as well as by developers to save temporary data such as table variables or static cursors. As this database is intended for a short lifespan (temporary data only, which can be stored during execution of stored procedure or until session is disconnected), SQL Server clears this database by truncating all data from it or by dropping and recreating this database every time when it's started. As the tempdb database will never contain durable data, it has some special internal behavior and it's the reason why accessing data in this database is several times faster than accessing durable data in other databases. If this database is corrupted, restart SQL Server. Database resourcedb The resourcedb is fifth in our enumeration and consists of definitions for all system objects of SQL Server, for example, sys.objects. This database is hidden and we don't need to care about it that much. It is not configurable and we don't use regular backup strategies for it. It is always placed in the installation path of SQL Server (to the binn directory) and it's backed up within the filesystem backup. In case of an accident, it is recovered as a part of the filesystem as well. Filegroup Filegroup is an organizational metadata object containing one or more data files. Filegroup does not have its own representation in the filesystem--it's just a group of files. When any database is created, a filegroup called primary is always created. This primary filegroup always contains the primary data file. Filegroups can be divided into the following: Row storage filegroups: These filegroup can contain data files (mdf or ndf). Filestream filegroups: This kind of filegroups can contain not files but folders to store binary data. In-memory filegroup: Only one instance of this kind of filegroup can be created in a database. Internally, it is a special case of filestream filegroup and it's used by SQL Server to persist data from in-memory tables. Every filegroup has three simple properties: Name: This is a descriptive name of the filegroup. The name must fulfill the naming convention criteria. Default: In a set of filegroups of the same type, one of these filegroups has this option set to on. This means that when a new table or index is created without explicitly specified to which filegroup it has to store data in, the default filegroup is used. By default, the primary filegroup is the default one. Read-only: Every filegroup, except the primary filegroup, could be set to read- only. Let's say that a filegroup is created for last year's history. When data is moved from the current period to tables created in this historical filegroup, the filegroup could be set as read-only, and later the filegroup cannot be backed up again and again. It is a very good approach to divide the database into smaller parts-- filegroups with more files. It helps in distributing data across more physical storage and also makes the database more manageable; backups can be done part by part in shorter times, which better fit into a service window. Data files Every database must have at least one data file called primary data file. This file is always bound to the primary filegroup. In this file is all the metadata of the database, such as structure descriptions (could be seen through views such as sys.objects, sys.columns, and others), users, and so on. If the database does not have other data files (in the same or other filegroups), all user data is also stored in this file, but this approach is good enough just for smaller databases. Considering how the volume of data in the database grows over time, it is a good practice to add more data files. These files are called secondary data files. Secondary data files are optional and contain user data only. Both types of data files have the same internal structure. Every file is divided into 8 KB small parts called data pages. SQL Server maintains several types of data pages such as data, data pages, index pages, index allocation maps (IAM) pages to locate data pages of tables or indexes, global allocation map (GAM) and shared global allocation maps (SGAM) pages to address objects in the database, and so on. Regardless of the type of a certain data page, SQL Server uses a data page as the smallest unit of I/O operations between hard disk and memory. Let's describe some common properties: A data page never contains data of several objects Data pages don't know each other (and that's why SQL Server uses IAMs to allocate all pages of an object) Data pages don't have any special physical ordering A data row must always fit in size to a data page These properties could seem to be useless but we have to keep in mind that when we know these properties, we can better optimize and manage our databases. Did you know that a data page is the smallest storage unit that can be restored from backup? As a data page is quite a small storage unit, SQL Server groups data pages into bigger logical units called extents. An extent is a logical allocation unit containing eight coherent data pages. When SQL Server requests data from disk, extents are read into memory. This is the reason why 64 KB NTFS clusters are recommended to format disk volumes for data files. Extents could be uniform or mixed. Uniform extent is a kind of extent containing data pages belonging to one object only; on the other hand, a mixed extent contains data pages of several objects. Transaction log When SQL Server processes any transaction, it works in a way called two-phase commit. When a client starts a transaction by sending a single DML request or by calling the BEGIN TRAN command, SQL Server requests data pages from disk to memory called buffer cache and makes the requested changes in these data pages in memory. When the DML request is fulfilled or the COMMIT command comes from the client, the first phase of the commit is finished, but data pages in memory differ from their original versions in a data file on disk. The data page in memory is in a state called dirty. When a transaction runs, a transaction log file is used by SQL Server for a very detailed chronological description of every single action done during the transaction. This description is called write-ahead-logging, shortly WAL, and is one of the oldest processes known on SQL Server. The second phase of the commit usually does not depend on the client's request and is an internal process called checkpoint. Checkpoint is a periodical action that: searches for dirty pages in buffer cache, saves dirty pages to their original data file location, marks these data pages as clean (or drops them out of memory to free memory space), marks the transaction as checkpoint or inactive in the transaction log. Write-ahead-logging is needed for SQL Server during recovery process. Recovery process is started on every database every time SQL Server service starts. When SQL Server service stops, some pages could remain in a dirty state and they are lost from memory. This can lead to two possible situations: The transaction is completely described in the transaction log, the new content of the data page is lost from memory, and data pages are not changed in the data file The transaction was not completed at the moment SQL Server stopped, so the transaction cannot be completely described in the transaction log as well, data pages in memory were not in a stable state (because the transaction was not finished and SQL Server cannot know if COMMIT or ROLLBACK will occur), and the original version of data pages in data files is intact SQL Server decides these two situations when it's starting. If a transaction is complete in the transaction log but was not marked as checkpoint, SQL Server executes this transaction again with both phases of COMMIT. If the transaction was not complete in the transaction log when SQL Server stopped, SQL Server will never know what was the user's intention with the transaction and the incomplete transaction is erased from the transaction log as if it had never started. The aforementioned described recovery process ensures that every database is in the last known consistent state after SQL Server's startup. It's crucial for DBAs to understand write-ahead-logging when planning a backup strategy because when restoring the database, the administrator has to recognize if it's time to run the recovery process or not. To summarize, we introduced internal data handling as it is important not only during performance backups and restores but also for optimizing a database. If you are interested to know more about how to backup, recover and secure SQL Server, do checkout this book SQL Server 2017 Administrator's Guide.

0
0
5137

article-image-implementing-apache-spark-mllib-naive-bayes-to-classify-digital-breath-test-data-for-drunk-driving

Savia Lobo

27 Feb 2018

13 min read

Implementing Apache Spark MLlib Naive Bayes to classify digital breath test data for drunk driving

Savia Lobo

27 Feb 2018

13 min read

0
0
3504

How-To Tutorials - Data

How to set up a Deep Learning System on Amazon Web Services (AWS)

Implementing matrix operations using SciPy and NumPy

Implement Long-short Term Memory (LSTM) with TensorFlow

Logistic Regression Using TensorFlow

How to Compute Interpolation in SciPy

How to implement Reinforcement Learning with TensorFlow

How to use MapReduce with Mongo shell

How to compute Discrete Fourier Transform (DFT) using SciPy

4 must-know levels in MongoDB security

Implementing Apache Spark K-Means Clustering method on digital breath test data for road safety

Trending Topics

6 index types in PostgreSQL 10 you should know

Getting to know SQL Server options for disaster recovery

How to win Kaggle competition with Apache SparkML

How SQL Server handles data under the hood

Implementing Apache Spark MLlib Naive Bayes to classify digital breath test data for drunk driving