Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-containerized-data-science-docker

03 Jul 2016

4 min read

Containerized Data Science with Docker

03 Jul 2016

So, you're itching to begin your journey into data science but you aren't sure where to start. Well, I'm glad you’ve found this post since I will give the details in a step-by-step fashion as to how I circumvented the unnecessarily large technological barrier to entry and got my feet wet, so to speak. Containerization in general and Docker in particular have taken the IT world by storm in the last couple of years by making LXC containers more than just VM alternatives for the enterprising sysadmin. Even if you're coming at this post from a world devoid of IT, the odds are good that you've heard of Docker and their cute whale mascot. Of course, now that Microsoft is on board, the containerization bandwagon and a consortium of bickering stakeholders have formed, so you know that container tech is here to stay. I know, FreeBSD has had the concept of 'jails' for almost two decades now. But thanks to Docker, container tech is now usable across the big three of Linux, Windows and Mac (if a bit hack-y in the case of the latter two), and today we're going to use its positives in an exploration into the world of data science. Now that I have your interest piqued, you're wondering where the two intersect. Well, if you're like me, you've looked at the footprint of R-studio and the nightmare maze of dependencies of IPython and “noped” right out of there. Thanks to containers, these problems are solved! With Docker, you can limit the amount of memory available to the container, and the way containers are constructed ensures that you never have to deal with troubleshooting broken dependencies on update ever again. So let's install Docker, which is as straightforward as using your package manager in Linux, or downloading Docker Toolbox if you're using a Mac or Windows PC, and running the installer. The instructions that follow will be tailored to a Linux installation, but are easily adapted to Windows or Mac as well. On those two platforms, you can even bypass these CLI commands and use Kitematic, or so I hear. Now that you have Docker installed, let's look at some use cases for how to use it to facilitate our journey into data science. First, we are going to pull the Jupyter Notebook container so that you can work with that language-agnostic tool. # docker run --rm -it -p 8888:8888 -v "$(pwd):/notebooks" jupyter/notebook The -v "$(pwd):/notebooks" flag will mount the current directory to the /notebooks directory in the container, allowing you to save your work outside the container. This will be important because you’ll be using the container as a temporary working environment. The --rm flag ensures that the container is destroyed when it exits. If you rerun that command to get back to work after turning off your computer for instance, the container will be replaced with an entirely new one. That flag allows it access to the folder on the local filesystem, ensuring that your work survives the casually disposable nature of development containers. Now go ahead and navigate to http://localhost:8888, and let's get to work. You did bring a dataset to analyze in a notebook, right? The actual nuts and bolts of data science are beyond the scope of this post, but for a quick intro to data and learning materials, I've found Kaggle to be a great resource. While we're at it, you should look at that other issue I mentioned previously—that of the application footprint. Recently a friend of mine convinced me to use R, and I was enjoying working with the language until I got my hands on some real data and immediately felt the pain of an application not designed for endpoint use. I ran a regression and it locked up my computer for minutes! Fortunately, you can use a container to isolate it and only feed it limited resources to keep the rest of the computer happy. # docker run -m 1g -ti --rm r-base This command will drop you into an interactive R CLI that should keep even the leanest of modern computers humming along without a hiccup. Of course, you can also use the -c and --blkio-weight flags to restrict access to the CPU and HDD resources respectively, if limiting it to the GB of RAM wasn't enough. So, a program installation and a command or two (or a couple of clicks in the Kitematic GUI), and we're off and running using data science with none of the typical headaches. About the Author Darwin Corn is a systems analyst for the Consumer Direct Care Network. He is a mid-level professional with diverse experience in the information technology world.

0
0
4539

article-image-tensorflow-next-gen-machine-learning

Ariel Scarpinelli

01 Jun 2016

7 min read

Tensorflow: Next Gen Machine Learning

Ariel Scarpinelli

01 Jun 2016

7 min read

Last November, Google open sourced its shiny Machine Intelligence package, promising a simpler way to develop deep learning algorithms that can be deployed anywhere, from your phone to a big cluster without a hassle. They even take advantage of running over GPUs for better performance. Let's Give It a Shot! First things first, let's install it: # Ubuntu/Linux 64-bit, CPU only (GPU enabled version requires more deps): $ sudo pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.7.1-cp27-none-linux_x86_64.whl # Mac OS X, CPU only:$ sudo easy_install --upgrade six$ sudo pip install --upgrade https://storage.googleapis.com/tensorflow/mac/tensorflow-0.7.1-cp27-none-any.whl We are going to play with the old-known iris dataset, where we will train a neural network to take dimensions of the sepals and petals of an iris plant and classify it between three different types of iris plants: Iris setosa, Iris versicolour, and Iris virginica. You can download the training CSV dataset from here. Reading the Training Data Because TensorFlow is prepared for cluster-sized data, it allows you to define an input by feeding it with a queue of filenames to process (think of MapReduce output shards). In our simple case, we are going to just hardcode the path to our only file: import tensorflow as tf def inputs(): filename_queue = tf.train.string_input_producer(["iris.data"]) We then need to set up the Reader, which will work with the file contents. In our case, it's a TextLineReader that will produce a tensor for each line of text in the dataset: reader = tf.TextLineReader()key, value = reader.read(filename_queue) Then we are going to parse each line into the feature tensor of each sample in the dataset, specifying the data types (in our case, they are all floats except the iris class, which is a string). # decode_csv will convert a Tensor from type string (the text line) in # a tuple of tensor columns with the specified defaults, which also # sets the data type for each column sepal_length, sepal_width, petal_length, petal_width, label = tf.decode_csv(value, record_defaults=[[0.0], [0.0], [0.0], [0.0], [""]]) # we could work with each column separately if we want; but here we # simply want to process a single feature vector containing all the # data for each sample. features = tf.pack([sepal_length, sepal_width, petal_length, petal_width]) Finally, in our data file, the samples are actually sorted by iris type. This would lead to bad performance of the model and make it inconvenient for splitting between training and evaluation sets, so we are going to shuffle the data before returning it by using a tensor queue designed for it. All the buffering parameters can be set to 1500 because that is the exact number of samples in the data, so will store it completely in memory. The batch size will also set the number of rows we pack in a single tensor for applying operations in parallel: return tf.train.shuffle_batch([features, label], batch_size=100, capacity=1500, min_after_dequeue=100) Converting the Data Our label field on the training dataset is a string that holds the three possible values of the Iris class. To make it friendly with the neural network output, we need to convert this data to a three-column vector, one for each class, where the value should be 1 (100% probability) when the sample belongs to that class. This is a typical transformation you may need to do with input data. def string_label_as_probability_tensor(label): is_setosa = tf.equal(label, ["Iris-setosa"]) is_versicolor = tf.equal(label, ["Iris-versicolor"]) is_virginica = tf.equal(label, ["Iris-virginica"]) return tf.to_float(tf.pack([is_setosa, is_versicolor, is_virginica])) The Inference Model (Where the Magic Happens) We are going to use a single neuron network with a Softmax activation function. The variables (learned parameters of our model) will only be the matrix weights applied to the different features for each sample of input data. # model: inferred_label = softmax(Wx + b) # where x is the features vector of each data example W = tf.Variable(tf.zeros([4, 3])) b = tf.Variable(tf.zeros([3])) def inference(features): # we need x as a single column matrix for the multiplication x = tf.reshape(features, [1, 4]) inferred_label = tf.nn.softmax(tf.matmul(x, W) + b) return inferred_label Notice that we left the model parameters as variables outside of the scope of the function. That is because we want to use those same variables both while training and when evaluating and using the model. Training the Model We train the model using backpropagation, trying to minimize cross entropy, which is the usual way to train a Softmax network. At a high level, this means that for each data sample, we compare the output of the inference with the real value and calculate the error (how far we are). Then we use the error value to adjust the learning parameters in a way that minimizes that error. We also have to set the learning factor; it means for each sample, how much of the computed error we will apply to correct the parameters. There has to be a balance between the learning factor, the number of learning loop cycles, and the number of samples we pack tighter in the same tensor in batch; the bigger the batch, the smaller the factor and the higher the number of cycles. def train(features, tensor_label): inferred_label = inference(features) cross_entropy = -tf.reduce_sum(tensor_label*tf.log(inferred_label)) train_step = tf.train.GradientDescentOptimizer(0.001) .minimize(cross_entropy) return train_step Evaluating the Model We are going to evaluate our model using accuracy, which is the ratio of cases where our network identifies the right iris class over the total evaluation samples. def evaluate(evaluation_features, evaluation_labels): inferred_label = inference(evaluation_features) correct_prediction = tf.equal(tf.argmax(inferred_label, 1), tf.argmax(evaluation_labels, 1)) accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) return accuracy Running the Model We are only left to connect our graph and run it in a session, where the defined operations are actually going to use the data. We also split our input data between training and evaluation around 70%:30%, and run a training loop with it 1,000 times. features, label = inputs() tensor_label = string_label_as_probability_tensor(label) train_step = train(features[0:69, 0:4], tensor_label[0:69, 0:3]) evaluate_step = evaluate(features[70:99, 0:4], tensor_label[70:99, 0:3]) with tf.Session() as sess: sess.run(tf.initialize_all_variables()) # Start populating the filename queue. coord = tf.train.Coordinator() threads = tf.train.start_queue_runners(coord=coord) for i in range(1000): sess.run(train_step) print sess.run(evaluate_step) # should print 0 => setosa print sess.run(tf.argmax(inference([[5.0, 3.6, 1.4, 0.2]]), 1)) # should be 1 => versicolor print sess.run(tf.argmax(inference([[5.5, 2.4, 3.8, 1.1]]), 1)) # should be 2 => virginica print sess.run(tf.argmax(inference([[6.9, 3.1, 5.1, 2.3]]), 1)) coord.request_stop() coord.join(threads) sess.closs() If you run this, it should print an accuracy value close to 1. This means our network correctly classifies the samples in almost 100% of the cases, and also we are providing the right answers for the manual samples to the model. Conclusion Our example was very simple, but TensorFlow actually allows you to do much more complicated things with similar ease, such as working with voice recognition and computer vision. It may not look much different than using any other deep learning or math packages, but the key is the ability to run the expressed model in parallel. Google is willing to create a mainstream DSL to express data algorithms focused on machine learning, and they may succeed in doing so. For instance, although Google has not yet open sourced the distributed version of the engine, a tool capable of running Tensorflow-modeled graphs directly over an Apache Spark cluster was just presented at the Spark Summit, which shows that the community is interested in expanding its usage. About the author Ariel Scarpinelli is a senior Java developer in VirtualMind and is a passionate developer with more than 15 years of professional experience. He can be found on Twitter at @ triforcexp.

0
0
2302

article-image-picking-tensorflow-can-now-pay-dividends-sooner

Sam Abrahams

23 May 2016

9 min read

Picking up TensorFlow can now pay dividends sooner

Sam Abrahams

23 May 2016

9 min read

It's been nearly four months since TensorFlow, Google's computation graph machine learning library, was open sourced, and the momentum from its launch is still going strong. Over the time, both Microsoft and Baidu have released their own deep-learning libraries (CNTK and warp-ctc, respectively), and the machine learning arms race has escalated even further with Yahoo open sourcing CaffeOnSpark. Google hasn't been idle, however, and with the recent releases of TensorFlow Serving and the long awaited distributed runtime, now is the time for businesses and individual data scientists to ask: is it time to commit to TensorFlow? TensorFlow's most appealing features There are a lot of machine learning libraries available today—what makes TensorFlow stand out in this crowded space? 1. Flexibility without headaches TensorFlow heavily borrows concepts from the more tenured machine learning library Theano. Many models written for research papers were built in Theano, and its composable, node-by-node writing style translates well when implementing a model whose graph was drawn by hand first. TensorFlow's API is extremely similar. Both Theano and TensorFlow feature a Python API for defining the computation graph, which then hooks into high performance C/C++ implementations of mathematical operations. Both are able to automatically differentiate their graphs with respect to their inputs, which facilitates learning on complicated neural network structures and both integrate tightly with Numpy for defining tensors (n-dimensional arrays). However, one of the biggest advantages TensorFlow currently has over Theano (at least when comparing features both Theano and TensorFlow have) is its compile time. As of the time of writing this, Theano's compile times can be quite lengthy and although there are options to speed up compilation for experimentation, they come at the cost of a slower output model. TensorFlow's compilation is much faster, which leads to less headaches when trying out slightly different versions of models. 2. It's backed by Google (and the OSS community) At first, it may sound more like brand recognition than a tangible advantage, but when I say it's 'backed' by Google, what I mean is that Google is seriously pouring tons of resources into making TensorFlow an awesome tool. There is an entire team at Google dedicated on maintaining and improving the software steadily and visibly, while simultaneously running a clinic on how to properly interact with and engage the open source community. Google proved itself willing to adopt quality submissions from the community as well as flexible enough to adapt to public demands (such as moving the master contribution repository from Google's self-hosted Gerrit server to GitHub). These actions combined with genuinely constructive feedback from Google's team on pull-requests and issues helped make the community feel like this was a project worth supporting. The result? A continuous stream of little improvements and ideas from the community while the core Google team works on releasing larger features. Not only does TensorFlow recieve the benefits of a larger contributor base because of this, it also is more likely to withstand user decay as more people have invested time in making TensorFlow their own. 3. Easy visualizations and debugging with TensorBoard TensorBoard was the shiny toy that shipped on release with the first open source version of TensorFlow, but it's much more than eye candy. Not only can you use it as a guide to ensure what you've coded matches your reference model, but you can also keep track of data flowing through your model. This is especially useful when debugging subsections of your graph, as you can go in and see where any hiccups may have occurred. 4. TensorFlow Serving cuts the development-deployment cycle by nearly half The typical life cycle of machine learning models in the business world is generally as follows: Research and develop a model that is more accurate/faster/more descriptive than the previous model Write down the exact specifications of the finalized model Recreate the model in C++/C/Java/some other fast, compiled language Push the new model into deployment, replacing the old model Repeat On release, TensorFlow promised to "connect research and production." However, the community had to wait until just recently for that promise to come to fruition with TensorFlow Serving. This software allows you to run it as a server that can natively run models built in TensorFlow, which makes the new life cycle look like this: Research and develop a new model Hook the new model into TensorFlow Serving Repeat While there is overhead in learning how to use TensorFlow Serving, the process of hooking up new models stays the same, whereas rewriting new models in a different language is time consuming and difficult. 5. Distributed learning out of the box The distributed runtime is one of the newest features to be pushed to the TensorFlow repository, but it has been, by far, the most eagerly anticipated aspect of TensorFlow. Without having to incorporate any other libraries or software packages, TensorFlow is able to run distributed learning tasks on heterogenous hardware with various CPUs and GPUs. This feature is absolutely brand new (it came out in the middle of writing this post!), so do your research on how to use it and how well it runs. Areas to look for improvement TensorFlow can't claim to be the best at everything, and there are several sticking points that should be addressed sooner rather than later. Luckily, Google has been making steady improvements to TensorFlow since it was released, and I would be surprised if most of these were not remedied within the next few months. Runtime speed Although the TensorFlow team promises deployment worthy models from compiled TensorFlow code, at this time, its single machine training speed lags behind most other options. The team has made improvements in speed since its release, but there is still more work to be done. In-place operations, a more efficient node placement algorithm, and better compression techniques could help here. Distributed benchmarks are not available at this time—expect to see them after the next official TensorFlow release. Pre-trained models Libraries such as Caffe, Torch, and Theano have a good selection of pre-trained, state-of-the-art models that are implemented in their library. While Google did release a version of its Inception-v3 model in TensorFlow, it needs more options to provide a starting place for more types of problems. Expanded distributed support Yes, TensorFlow did push code for it's distributed runtime, but it still needs better documentation as well as more examples. I'm incredibly excited that it's available to try out right now, but it's going to take some time for most people to put it into production. Interested in getting up and running with TensorFlow? You'll need a primer on Python. Luckily, our Python Fundamentals course in Mapt gives you an accessible yet comprehensive journey through Python - and this week it's completely free. Click here, login, then get stuck in... The future Most people want to use software that is going to last for more than a few months—what does the future look like for TensorFlow? Here are my predictions about the medium-term future of the library. Enterprise-level distributions Just as Hadoop has commercial distributions of its software, I expect to see more and more companies offering supported suites that tie into TensorFlow. Whether they have more pre-trained models built on top of Keras (which already supports a TensorFlow backend), or make TensorFlow work seamlessly with a distributed file system like Hadoop, I forsee a lot of demand for enterprise features and support with TensorFlow. TensorFlow's speed will catch up (and most users won't need it) As mentioned earlier, TensorFlow still lags behind many other libraries out there. However, with the improvements already made; it's clear that Google is determined to make TensorFlow as efficient as possible. That said, I believe most applications of TensorFlow won't desperately need the speed increase. Of course, it's nice to have your models run faster, but most businesses out there don't have petabytes of useful data to work with, which means that model training usually doesn't take the "weeks" that we often see claimed as training time. TensorFlow is going to get easier, not more difficult, over time While there are definitely going to be many new features in upcoming releases of TensorFlow, I expect to see the learning curve of the software go down as more resources, such as tutorials, examples, and books are made available. The documentation's terminology has already changed in places to be more understandable; navigation within the documentation should improve over time. Finally, while most of the latest features in TensorFlow don't have the friendliest APIs right now, I'd be shocked if more user-friendly versions of TensorFlow Serving and the distributed runtime weren't in the works right now. Should I use TensorFlow? TensorFlow appears primed to fulfil the promise that was made back in November: a distributed, flexible data flow graph library that excels at neural network composition. I leave it to you decision makers to figure out whether TensorFlow is the right move for your own machine learning tasks, but here is my overall impression of TensorFlow: no other machine learning framework targeted at production-level tasks is as flexible, powerful, or improving as rapidly as TensorFlow. While other frameworks may carry advantages over TensorFlow now, Google is putting the effort into making consistent improvements, which bodes well for a community that is still in its infancy. About the author Sam Abrahams is a freelance data engineer and animator in Los Angeles, CA. He specializes in real-world applications of machine learning and is a contributor to TensorFlow. Sam runs a small tech blog, Memdump, and is an active member of the local hacker scene in West LA.

0
0
1967

article-image-bridging-gap-between-data-science-and-devops

Richard Gall

23 Mar 2016

5 min read

Bridging the gap between data science and DevOps with DataOps

Richard Gall

23 Mar 2016

5 min read

What’s the real value of data science? Hailed as the sexiest job of the 21st century just a few years ago, there are rumors that it’s not quite proving its worth. Gianmario Spacagna, a data scientist for Barclays bank in London, told Computing magazine at Spark Summit Europe in October 2015 that, in many instances, there’s not enough impact from data science teams – “It’s not a playground. It’s not academic” he said. His solution sounds simple. We need to build a bridge between data science and DevOps - and DataOps is perhaps the answer. He says: "If you're a start-up, the smartest person you want to hire is your DevOps guy, not a data scientist. And you need engineers, machine learning specialists, mathematicians, statisticians, agile experts. You need to cover everything otherwise you have a very hard time to actually create proper applications that bring value." This idea makes a lot of sense. It’s become clear over the past few years that ‘data’ itself isn’t enough; it might even be distracting for some organizations. Sometimes too much time is spent in spreadsheets and not enough time is spent actually doing stuff. Making decisions, building relationships, building things – that’s where real value comes from. What Spacagna has identified is ultimately a strategic flaw within how data science is used in many organizations. There’s often too much focus on what data we have and what we can get, rather than who can access it and what they can do with it. If data science isn’t joining the dots, DevOps can help. True, a large part of the problem is strategic, but DevOps engineers can also provide practical solutions by building dashboards and creating APIs. These sort of things immediately give data additional value by making they make it more accessible and, put simply, more usable. Even for a modest medium sized business, data scientists and analysts will have minimal impact if they are not successfully integrated into the wider culture. While it’s true that many organizations still struggle with this, Airbnb demonstrate how to do it incredibly effectively. Take a look at their Airbnb Engineering and Data Science publication on Medium. In this post, they talk about the importance of scaling knowledge effectively. Although they don’t specifically refer to DevOps, it’s clear that DevOps thinking has informed their approach. In the products they’ve built to scale knowledge, for example, the team demonstrate a very real concern for accessibility and efficiency. What they build is created so people can do exactly what they want and get what they need from data. It’s a form of strict discipline that is underpinned by a desire for greater freedom. If you keep reading Airbnb’s publication, another aspect of ‘DevOps thinking’ emerges: a relentless focus on customer experience. By this, I don’t simply mean that the work done by the Airbnb engineers is specifically informed by a desire to improve customer experiences; that’s obvious. Instead, it’s the sense that tools through which internal collaboration and decision making take place should actually be similar to a customer experience. They need to be elegant, engaging, and intuitive. This doesn’t mean seeing every relationship as purely transactional, based on some perverse logic of self-interest, but rather having a deeper respect for how people interact and share ideas. If DevOps is an agile methodology that bridges the gap between development and operations, it can also help to bridge the gap between data and operations. DataOps - bringing DevOps and data science together This isn’t a new idea. As much as I’d like to, I can’t claim credit for inventing ‘DataOps’. But there’s not really much point in asserting that distinction. DataOps is simply another buzzword for the managerial class. And while some buzzwords have value, I’m not so sure that we need another one. More importantly, why create another gap between Data and Development? That gap doesn’t make sense in the world we’re building with software today. Even for web developers and designers, the products they are creating are so driven by data that separating the data from the dev is absurd. Perhaps then, it’s not enough to just ask more from our data science as Gianmario Spacagna does. DevOps offers a solution, but we’re going to miss out on the bigger picture if we start asking for more DevOps engineers and some space for them to sit next to the data team. We also need to ask how data science can inform DevOps too. It’s about opening up a dialogue between these different elements. While DevOps evangelists might argue that DevOps has already started that, the way forward is to push for more dialogue, more integration and more collaboration. As we look towards the future, with the API economy becoming more and more important to the success of both startups and huge corporations, the relationships between all these different areas are going to become more and more complex. If we want to build better and build smarter we’re going to have to talk more. DevOps and DataOps both offer us a good place to start the conversation, but it’s important to remember it’s just the start.

0
0
7482

article-image-packt-explains-deep-learning-in-90-seconds

Packt Publishing

01 Mar 2016

1 min read

Packt Explains... Deep Learning in 90 seconds

Packt Publishing

01 Mar 2016

1 min read

If you've been looking into the world of Machine Learning lately you might have heard about a mysterious thing called “Deep Learning”. But just what is Deep Learning, and what does it mean for the world of Machine Learning as a whole? Take less than two minutes out of your day to find out and fully realize the awesome potential Deep Learning has with this video today.

0
0
2396

article-image-packt-explains-deep-learning

Packt Publishing

29 Feb 2016

1 min read

Packt Explains... Deep Learning

Packt Publishing

29 Feb 2016

1 min read

If you've been looking into the world of Machine Learning lately you might have heard about a mysterious thing called “Deep Learning”. But just what is Deep Learning, and what does it mean for the world of Machine Learning as a whole? Take less than two minutes out of your day to find out and fully realize the awesome potential Deep Learning has with this video today. Plus, if you’re already in love with Deep Learning, or want to finally start your Deep Learning journey then be sure to pick up one of our recommendations below and get started right now.

0
0
1336

Owen Roberts

22 Jan 2016

5 min read

This Year in Machine Learning

Owen Roberts

22 Jan 2016

5 min read

The world of data has really boomed in the last few years. When I first joined Packt Hadoop was The Next Big Thing on the horizon and what people are now doing with all the data we have available to us was unthinkable. Even in the first few weeks of 2016 we’re already seeing machine learning being used in ways we probably wouldn’t have thought about even a few years ago – we’re using machine learning for everything from discovering a supernova that was 570 billion times brighter than the sun to attempting to predict this year’s Super Bowl winners based on past results, but So what else can we expect in the next year for machine learning and how will it affect us? Based on what we’ve seen over the last three years here are a few predictions about what we can expect to happen in 2016 (With maybe a little wishful thinking mixed in too!) Machine Learning becomes the new Cloud Not too long ago every business started noticing the cloud, and with it came a shift in how companies were structured. Infrastructure was radically adapted to take full advantage that the benefits that the cloud offers and it doesn’t look to be slowing down with Microsoft recently promising to spend over $1 billion in providing free cloud resources for non-profits. Starting this year it’s plausible that we’ll see a new drive to also bake machine learning into the infrastructure. Why? Because every company will want to jump on that machine learning bandwagon! The benefits and boons to every company are pretty enticing – ML offers everything from grandiose artificial intelligence to much more mundane such as improvements to recommendation engines and targeted ads; so don’t be surprised if this year everyone attempts to work out what ML can do for them and starts investing in it. The growth of MLaaS Last year we saw Machine Learning as a Service appear on the market in bigger numbers. Amazon, Google, IBM, and Microsoft all have their own algorithms available to customers. It’s a pretty logical move and why that’s not all surprising. Why? Well, for one thing, data scientists are still as rare as unicorns. Sure, universities are creating new course and training has become more common, but the fact remains we won’t be seeing the benefits of these initiatives for a few years. Second, setting up everything for your own business is going to be expensive. Lots of smaller companies simply don’t have the money to invest in their own personal machine learning systems right now, or have the time needed to fine tune it. This is where sellers are going to be putting their investments this year – the smaller companies who can’t afford a full ML experience without outside help. Smarter Security with better protection The next logical step in security is tech that can sense when there are holes in its own defenses and adapt to them before trouble strikes. ML has been used in one form or another for several years in fraud prevention, but in the IT sector we’ve been relying on static rules to detect attack patterns. Imagine if systems could detect irregular behavior accurately or set up risk scores dynamically in order to ensure users had the best protection they could at any time? We’re a long way from this being fool-proof unfortunately, but as the year progresses we can expect to see the foundations of this start being seen. After all, we’re already starting to talk about it. Machine Learning and Internet of Things combine We’re already nearly there, but with the rise in interest in the IoT we can expect that these two powerhouses to finally combine. The perfect dream for IoT hobbyists has always been like something out of the Jetsons or Wallace and Gromit –when you pass that sensor by the frame of your door in the morning your kettle suddenly springs to life so you’re able to have that morning coffee without waiting like the rest of us primals; but in truth the Internet of Things has the potential to be so much more than just making the lives of hobbyists much easier. By 2020 it is expected that over 25 billion ‘Things’ will be connected to the internet, and each one will be collating reams and reams of data. For a business with the capacity to process this data they can collect the insight they could collect is a huge boon for everything from new products to marketing strategy. For IoT to really live up to the dreams we have for it we need a system that can recognize and collate relevant data, which is where a ML system is sure to take center stage. Big things are happening in the world of machine learning, and I wouldn’t be surprised if something incredibly left field happens in the data world that takes us all by surprise, but what do you think is next for ML? If you’re looking to either start getting into the art of machine learning or boosting your skills to the next level then be sure to give our Machine Learning tech page a look; it’s filled our latest and greatest ML books and videos out right now along with the titles we’re realizing soon, available to preorder in your format of choice.

0
0
1983

article-image-your-machine-learning-plotting-kill-you

Sam Wood

21 Jan 2016

4 min read

Is Your Machine Learning Plotting To Kill You?

Sam Wood

21 Jan 2016

4 min read

Artificial Intelligence is just around the corner. Of course, it's been just around the corner for decades, but in part that's our own tendency to move the goalposts about what 'intelligence' is. Once, playing chess was one of the smartest things you could do. Now that a computer can easily beat a Grand Master, we've reclassified it as just standard computation, not requiring proper thinking skills. With the rise of deep learning and the proliferation of machine learning analytics, we edge ever closer to the moment where a computer system will be able to accomplish anything and everything better than a human can. So should we start worrying about SkyNet? Yes and no. Rule of the Human Overlords Early use of artificial intelligence will probably look a lot like how we used machine learning today. We'll see 'AI empowered humans' being the Human Overlords to their robot servants. These AI are smart enough to come up with the 'best options' to address human problems, but haven't been given the capability to execute them. Think about Google Maps - there, an extremely 'intelligent' artificial program comes up with the quickest route for you to take to get from point A to point B. But it doesn't force you to take it - you get to decide from the options offered which one will best suit your needs. This is likely what working alongside the first AI will look like. Rise of the Driverless Car The problem is that we are almost certainly going to see the power of AI increase exponentially - and any human greenlighting will become an increasingly inefficient part of the system. In much the same way that we'll let the Google Maps AI start to make decisions for us when we let it drive our driverless cars, we'll likely start turning more and more of our decisions over for AI to take responsibility for. Super smart AI will also likely be able to comprehend things that humans just can't understand. The mass of data that it's analysed will be beyond any one human to be able to judge effectively. Even today, financial algorithms are making instantaneous choices about the stock market - with humans just clicking 'yes' because the computer knows best. We've already seen electronic trading glitches leading to economic crises - six years ago! Just how much responsibility might we start turning over to smart machines? The Need to Solve Ethics If we've given power to an AI to make decisions for us, we'll want to ensure it has our best interests at heart, right? It's vital to program some sort of ethical system into our AI - the problem is, humans aren't very good at deciding what is and isn't ethical! Think about a simple and seemingly universal rule like 'Don't kill people'. Now think about all the ways we disagree about when it's okay to break that rule - in self-defence, in executing dangerous criminals, to end suffering, in combat. Imagine trying to code all of that into an AI, for every different moral variation. Arguably, it might be beyond human capacity. And as for right and wrong, well, we've had thousands of years of debate about that and we still can't agree exactly what is and isn't ethical. So how can we hope to program a morality system we'd be happy to give to an increasingly powerful AI? Avoiding SkyNet It may seem a little ridiculous to start worrying about the existential threat of AI when your machine learning algorithms keep bugging out on your constantly. And certainly, the possibilities offered by AI are amazing - more intelligence means faster, cheaper, and more effective solutions to humanity's problems. So despite the risk of us being outpaced by alien machine minds that have no concept of our human value system, we must always balance that risk against the amazing potential rewards. Perhaps what's most important is just not to be blase about what super-intelligent means for AI. And frankly, I can't remember how I lived before Google Maps.

0
0
1856

article-image-why-algorithm-never-win-pulitzer

Richard Gall

21 Jan 2016

6 min read

Why an algorithm will never win a Pulitzer

Richard Gall

21 Jan 2016

6 min read

In 2012, a year which feels a lot like the very early years of the era of data, Wired published this article on Narrative Science, an organization based in Chicago that uses Machine Learning algorithms to write news articles. Its founder and CEO, Kris Hammond, is a man whose enthusiasm for algorithmic possibilities is unparalleled. When asked whether an algorithm would win a Pulitzer in the next 20 years he goes further, claiming that it could happen in the next 5 years. Hammond’s excitement at what his organization is doing is not unwarranted. But his optimism certainly is. Unless 2017 is a particularly poor year for journalism and literary nonfiction, a Pulitzer for one of Narrative Science’s algorithms looks unlikely to say the least. But there are a couple of problems with Hammond’s enthusiasm. He fails to recognise the limitations of algorithms, the fact that the job of even the most intricate and complex Deep Learning algorithm is very specific is quite literally determined by the people who create it. “We are humanising the machine” he’s quoted as saying in a Guardian interview from June 2015. “Based on general ideas of what is important and a close understanding of who the audience is, we are giving it the tools to know how to tell us stories”. It’s important to notice here how he talks - it’s all about what ‘we’re’ doing. The algorithms that are central to Narrative Science’s mission are things that are created by people, by data scientists. It’s easy to read what’s going on as a simple case of the machines taking over. True, perhaps there is cause for concern among writers when he suggests that in 25 years 90% of news stories will be created by algorithms, but in actual fact there’s just a simple shift in where labour is focused. It's time to rethink algorithms We need to rethink how we view and talk about data science, Machine Learning and algorithms. We see, for example, algorithms as impersonal, blandly futuristic things. Although they might be crucial to our personalized online experiences, they are regarded as the hypermodern equivalent of the inauthentic handshake of a door to door salesman. Similarly, at the other end, the process of creating them are viewed as a feat of engineering, maths and statistics nerds tackling the complex interplay of statistics and machinery. Instead, we should think of algorithms as something creative, things that organize and present the world in a specific way, like a well-designed building. If an algorithm did indeed win a Pulitzer, wouldn’t it really be the team behind it that deserves it? When Hammond talks, for example, about “general ideas of what is important and a close understanding who the audience is”, he is referring very much to a creative process. Sure, it’s the algorithm that learns this, but it nevertheless requires the insight of a scientist, an analyst to consider these factors, and to consider how their algorithm will interact with the irritating complexity and unpredictability of reality. Machine Learning projects, then, are as much about designing algorithms as they are programming them. There’s a certain architecture, a politics that informs them. It’s all about prioritization and organization, and those two things aren’t just obvious; they’re certainly not things which can be identified and quantified. They are instead things that inform the way we quantify, the way we label. The very real fingerprints of human imagination, and indeed fallibility are in algorithms we experience every single day. Algorithms are made by people Perhaps we’ve all fallen for Hammond’s enthusiasm. It’s easy to see the algorithms as the key to the future, and forget that really they’re just things that are made by people. Indeed, it might well be that they’re so successful that we forget they’ve been made by anyone - it’s usually only when algorithms don’t work that the human aspect emerges. The data-team have done their job when no one realises they are there. An obvious example: You can see it when Spotify recommends some bizarre songs that you would never even consider listening to. The problem here isn’t simply a technical one, it’s about how different tracks or artists are tagged and grouped, how they are made to fit within a particular dataset that is the problem. It’s an issue of context - to build a great Machine Learning system you need to be alive to the stories and ideas that permeate within the world in which your algorithm operates - if you, as the data scientist lack this awareness, so will your Machine Learning project. But there have been more problematic and disturbing incidents such as when Flickr auto tags people of color in pictures as apes, due to the way a visual recognition algorithm has been trained. In this case, the issue is with a lack of sensitivity about the way in which an algorithm may work - the things it might run up against when it’s faced with the messiness of the real-world, with its conflicts, its identities, ideas and stories. The story of Solid Gold Bomb too, is a reminder of the unintended consequences of algorithms. It’s a reminder of the fact that we can be lazy with algorithms; instead of being designed with thought and care they become a surrogate for it - what’s more is that they always give us a get out clause; we can blame the machine if something goes wrong. If this all sounds like I’m simply down on algorithms, that I’m a technological pessimist, you’re wrong. What I’m trying to say is that it’s humans that are really in control. If an algorithm won a Pulitzer, what would that imply – it would mean the machines have won. It would mean we’re no longer the ones doing the thinking, solving problems, finding new ones. Data scientists are designers As the economy becomes reliant on technological innovation, it’s easy to remove ourselves, to underplay the creative thinking that drives what we do. That’s what Hammond’s doing, in his frenzied excitement about his company - he’s forgetting that it’s him and his team that are finding their way through today’s stories. It might be easier to see creativity at work when we cast our eyes towards game development and web design, but data scientists are designers and creators too. We’re often so keen to stress the technical aspects of these sort of roles that we forget this important aspect of the data scientist skillset.

0
0
2944

Erol Staveley

18 Jan 2016

7 min read

Data Science Is the New Alchemy

Erol Staveley

18 Jan 2016

7 min read

Every day I come into work and sit opposite Greg. Greg (in my humble opinion) is a complete badass. He directly turns information that we’ve had hanging around for years and years into actual currency. Single handedly, he generates more direct revenue than any one individual in the business. When we were shuffling seating positions not too long ago (we now have room for that standing desk I’ve always wanted ❤), we were afraid to turn off his machine in fear of losing thousands upon thousands of dollars. I remember somebody saying “guys, we can’t unplug Skynet”. Nobody fully knows how it works. Nobody except Greg. We joked that by turning off his equipment, we’d ruin Greg's on-the-side Bitcoin mining gig that he was probably running off the back of the company network. We then all looked at one another in a brief moment of silence. We were all thinking the same thing — it wouldn’t surprise any of us if Greg was actually doing this. We wouldn’t know any better. To many, what Greg does is like modern day alchemy. In reality, Greg is a data scientist — an increasingly crucial role that helps businesses deliver more meaningful, relevant interactions with their customers. I like to think of them more as new-age alchemists, who wield keyboards instead of perfectly choreographed vials and alembics. This week - find out how to become a data alchemist with R. Save 50% on some of our top titles... or pick up any 5 for $50! Find them all here! Content might have been king a few years back. Now, it’s data. Everybody wants more — and the people who can actually make sense of it all. By surveying 20,000 developers, we found out just how valuable these roles are to businesses of all shapes and sizes. Let’s take a look. Every Kingdom Needs an Alchemist Even within quite a technical business, Greg’s work lends a fresh perspective on what it is other developers want from our content. Putting the value of direct revenue generation to one side, the insight we’ve derived from purchasing patterns and user behaviour is incredibly valuable. We’re constantly challenging our own assumptions, and spending more time looking at what our customers are actually doing. We’re not alone in taking this increasingly data-driven approach. In general, the highest data science salaries are paid by large enterprises. This isn’t too surprising considering that’s where the real troves of precious data reside. At such scale, the aggregation and management of data alone can warrant the recruitment of specialised teams. On average though, SMEs are not too far behind when it comes to how much they’re willing to pay for top talent. Average salary by company size. Apache Spark was a particularly important focus going forward for folks in the Enterprise segment. What’s clear is that data science isn’t just for big businesses any more. It’s for everybody. We can see that in the growth of data-related roles for SMEs. We’re paying more attention to data because it represents the actions of our customers, but also because we’ve just got more of it lying around all over the place. Irrespective of company size, the range of industries we captured (and classified) was colossal. Seems like everybody needs an alchemist these days. They Double as Snake Charmers When supply is low and demand is high in a particular job market, we almost always see people move to fill the gap. It’s a key driver of learning. After all, if you’re trying to move to a new role, you’re likely to be developing new skills. It’s no surprise that Python is the go-to choice for data science. It’s an approachable language with some great introductory resources out there on the market like Python for Secret Agents. It also has a fantastic ecosystem of data science libraries and documentation that can help you get up and running quite quickly. Percentage of respondents who said they used a given technology. When looking at roles in more detail, you see strong patterns between technologies used. For example, those using Python were most likely to also be using R. When you dive deeper into the data you start to notice a lot of crossover between certain segments. It was at this point where we were able to also start seeing the relationships between certain technologies in specific segments. For example, the Financial sector was more likely to use R, and also paid (on average) higher salaries to those who had a more diverse technical background. Alchemists Have Many Forms Back at a higher level, what was really interesting is the natural technology groupings that started to emerge between four very distinct ‘types’ of data alchemist. “What are they?”, I hear you ask. The Visualizers Those who bring data to life. They turn what otherwise would be a spreadsheet or a copy-and-paste pie chart into delightful infographics and informative dashboards. Welcome to the realm of D3.js and Tableau. The Wranglers The SME all-stars. They aggregate, clean and process data with Python whilst leveraging the functionality of libraries like pandas to their full potential. A jack of all trades, master of all. The Builders Those who use Hadoop and other OS tools to deploy and maintain large-scale data projects. They keep the world running by building robust, scalable data platforms. The Architects Those who harness the might of the enterprise toolchain. They co-ordinate large scale Oracle and Microsoft deployments, the sheer scale of which would break the minds of mere mortals. Download the Full Report With 20,000 developers taking part overall, our most recent data science survey contains plenty of juicy information about real-world skills, salaries and trends. Packtpub.com In a Land of Data, the Alchemist is King We used to have our reports delivered in Excel. Now we have them as notebooks on Jupyter. If it really is a golden age for developers, data scientists must be having a hard time keeping their inbox clear of all the recruitment spam. What’s really interesting going forward is that the volume of information we have to deal with is only going to increase. Once IoT really kicks off and wearables become more commonly accepted (the sooner the better if you’re Apple), businesses of all sizes will find dealing with data overload to be a key growing pain — regardless of industry. Plenty of web services and platforms are already popping up, promising to deliver ‘actionable insight’ to everybody who can spare the monthly fees. This is fine for standardised reporting and metrics like bounce rate and conversion, but not so helpful if you’re working with a product that’s unique to you. Greg’s work doesn’t just tell us how we can improve our SEO. It shows us how we can make our products better without having to worry about internal confirmation bias. It helps us better serve our customers. That’s why present-day alchemists like Greg, are heroes.

0
0
2207

article-image-level-your-companys-big-data-resource-management

Timothy Chen

24 Dec 2015

4 min read

Level Up Your Company's Big Data With Resource Management

Timothy Chen

24 Dec 2015

4 min read

Big data was once one of the biggest technology hypes, where tons of presentations and posts talked about how the new systems and tools allows large and complex data to be processed that traditional tools wasn't able to. While Big data was at the peak of its hype, most companies were still getting familiar with the new data processing frameworks such as Hadoop, and new databases such as HBase and Cassandra. Fast foward to now where Big data is still a popular topic, and lots of companies has already jumped into the Big data bandwagon and are already moving past the first generation Hadoop to evaluate newer tools such as Spark and newer databases such as Firebase, NuoDB or Memsql. But most companies also learn from running all of these tools, that deploying, operating and planning capacity for these tools is very hard and complicated. Although over time lots of these tools have become more mature, they are still usually running in their own independent clusters. It's also not rare to find multiple clusters of Hadoop in the same company since multi-tenant isn't built in to many of these tools, and you run the risk of overloading the cluster by a few non-critical big data jobs. Problems running indepdent Big data clusters There are a lot of problems when you run a lot of these independent clusters. One of them is monitoring and visibility, where all of these clusters have their own management tools and to integrate the company's shared monitoring and management tools is a huge challenge especially when onboarding yet another framework with another cluster. Another problem is multi-tenancy. Although having independent clusters solves the problem, another org's job can overtake the whole cluster. It still doesn't solve the problem when a bug in the Hadoop application just uses all the available resources and the pain of debugging this is horrific. A another problem is utilization, where a cluster is usually not 100% being utilized and all of these instances running in Amazon or in your datacenter are just racking up bills for doing no work. There are more major pain points that I don't have time to get into. Hadoop v2 The Hadoop developers and operators saw this problem, and in the 2nd generation of Hadoop they developed a separate resource management tool called YARN to have a single management framework that manages all of the resources in the cluster from Hadoop, enforce the resource limitations of the jobs, integrate security in the workload, and even optimize the workload by placing jobs closer to the data automatically. This solves a huge problem when operating a Hadoop cluster, and also consolidates all of the Hadoop clusters into one cluster since it allows a finer grain control over the workload and saves effiency of the cluster. Beyond Hadoop Now with the vast amount of Big data technologies that are growing in the ecosystem, there is a need to integrate a common resource management layer among all of the tools since without a single resource management system across all the frameworks we run back into the same problems as we mentioned before. Also when all these frameworks are running under one resource management platform, a lot of options for optimizations and resource scheduling are now possible. Here are some examples what could be possible with one resource management platform: With one resource management platform the platform can understand all of the cluster workload and available resources and can auto resize and scale up and down based on worklaods across all these tools. It can also resize jobs according to priority. The cluster is able to detect under utilization from other jobs and offer the slack resources to Spark batch jobs while not impacting your very important workloads from other frameworks, and maintain the same business deadlines and save a lot more cost. In the next post I'll continue to cover Mesos, which is one such resource management system and how the upcoming features in Mesos allows optimizations I mentioned to be possible. For more Big Data tutorials and analysis, visit our dedicated Hadoop and Spark pages. About the author Timothy Chen is a distributed systems engineer and entrepreneur. He works at Mesosphere and can be found on Github @tnachen.

0
0
1445

article-image-level-your-companys-big-data-mesos

Timothy Chen

23 Dec 2015

5 min read

Level Up Your Company's Big Data with Mesos

Timothy Chen

23 Dec 2015

5 min read

In my last post I talked about how using a resource management platform can allow your Big Data workloads to be more efficient with less resources. In this post I want to continue the discussion with a specific resource management platform, which is Mesos. Introduction to Mesos Mesos is an Apache top-level project that provides an abstraction to your datacenter resources and an API to program against these resources to launch and manage your workloads. Mesos is able to manage your CPU, memory, disk, ports and other resources that the user can custom defines. Every application that wants to use resources in the datacenter to run tasks talks with Mesos is called a scheduler. It uses the scheduler API to receive resource offers and each scheduler can decide to use the offer, decline the offer to wait for future ones, or hold on the offer for a period of time to combine the resources. Mesos will ensure to provide fairness amongst multiple schedulers so no one scheduler can overtake all the resources. So how does your Big data frameworks benefit specifically by using Mesos in your datacenter? Autopilot your Big data frameworks The first benefit of running your Big data frameworks on top of Mesos, which by abstracting away resources and providing an API to program against your datacenter, is that it allows each Big data framework to self-manage itself without minimal human intervention. How does the Mesos scheduler API provide self management to frameworks? First we should understand a little bit more what does the scheduler API allows you to do. The Mesos scheduler API provides a set of callbacks whenever the following events occurs: New resources available, task status changed, slave lost, executor lost, scheduler registered/disconnected, etc. By reacting to each event with the Big data framework's specific logic it allows frameworks to deploy, handle failures, scale and more. Using Spark as an example, when a new Spark job is launched it launches a new scheduler waiting for resources from Mesos. When new resources are available it deploys Spark executors to these nodes automatically and provide Spark task information to these executors and communicate the results back to the scheduler. When some reason the task is terminated unexpectedly, the Spark scheduler receives the notification and can automatically relaunch that task on another node and attempt to resume the job. When the machine crashes, the Spark scheduler is also notified and can relaunch all the executors on that node to other available resources. Moreover, since the Spark scheduler can choose where to launch the tasks it can also choose the nodes that provides the most data locality to the data it is going to process. It can also choose to deploy the Spark executors in different racks to have more higher availability if it's a long running Spark streaming job. As you can see, by programming against an API allows lots of flexibility and self-managment for the Big data frameworks, and saves a lot of manually scripting and automation that needs to happen. Manage your resources among frameworks and users When there are multiple Big data frameworks sharing the same cluster, and each framework is shared with multiple users, providing a good policy around ensuring the important users and jobs gets executed becomes very important. Mesos allows you to specify roles, where multiple frameworks can belong to a role. Mesos then allows operators to specify weights among these roles, so that the fair share is enforced by Mesos to provide the resources according to the weight specified. For example, one might provide 70% resources to Spark and 30% resources to general tasks with the weighted roles in Mesos. Mesos also allows reserving a fixed amount of resources per agent to a specific role. This ensures that your important workload is guaranteed to have enough resources to complete its workload. There are more features coming to Mesos that also helps multi-tenancy. One feature is called Quota where it ensures over the whole cluster that a certain amount of resources is reserved instead of per agent. Another feature is called dynamic reservation, which allows frameworks and operators to reserve a certain amount of resources at runtime and can unreserve them once it's no longer necessary. Optimize your resources among frameworks Using Mesos also boosts your utilization, by allowing multiple tasks from different frameworks to use the same cluster and boosts utilization without having separate clusters. There are a number of features that are currently being worked on that will even boost the utilization even further. The first feature is called oversubscription, which uses the tasks runtime statistics to estimate the amount of resources that is not being used by these tasks, and offers these resources to other schedulers so more resources is actually being utilized. The oversubscription controller also monitors the tasks to make sure when the task is being affected by sharing resources, it will kill these tasks so it's no longer being affected. Another feature is called optimistic offers, which allows multiple frameworks to compete for resources. This helps utilization by allowing faster scheduling and allows the Mesos scheduler to have more inputs to choose how to best schedule its resources in the future. As you can see Mesos allows your Big data frameworks to be self-managed, more efficient and allows optimizations that are only possible by sharing the same resource management. If you're curious how to get started you can follow at the Mesos website or Mesosphere website that provides even simpler tools to use your Mesos cluster. Want more Big Data tutorials and insight? Both our Spark and Hadoop pages have got you covered. About the author Timothy Chen is a distributed systems engineer and entrepreneur. He works at Mesosphere and can be found on Github @tnachen.

0
0
1832

article-image-biggest-big-data-and-business-intelligence-salary-and-skills-survey-2015

Packt Publishing

03 Aug 2015

1 min read

The biggest Big Data & Business Intelligence salary and skills survey of 2015

Packt Publishing

03 Aug 2015

1 min read

See the highlights from our comprehensive Skill Up IT industry salary reports, with data from over 20,000 IT professionals. Find out what trends are emerging in the world of data science and business intelligence and what skills you should be learning to further your career. Download the full size infographic here.

0
0
1446

article-image-reducing-cost-big-data-using-statistics-and-memory-technology-part-2

Praveen Rachabattuni

06 Jul 2015

6 min read

Reducing Cost in Big Data using Statistics and In-memory Technology - Part 2

Praveen Rachabattuni

06 Jul 2015

6 min read

In the first part of this two-part blog series, we learned that using statistical algorithms gives us a 95 percent accuracy rate for big data analytics, is faster, and is a lot more beneficial than waiting for the exact results. We also took a look at a few algorithms along with a quick introduction to Spark. Now let’s take a look at two tools in depth that are used with statistical algorithms: Apache Spark and Apache Pig. Apache Spark Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, and Python, as well as an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. At its core, Spark provides a general programming model that enables developers to write applications by composing arbitrary operators, such as mappers, reducers, joins, group-bys, and filters. This composition makes it easy to express a wide array of computations, including iterative machine learning, streaming, complex queries, and batch processing. In addition, Spark keeps track of the data that each of the operators produces, and enables applications to reliably store this data in memory. This is the key to Spark’s performance, as it allows applications to avoid costly disk accesses. It would be wonderful to have one tool for everyone, and one architecture and language for investigative as well as operational analytics. Spark’s ease of use comes from its general programming model, which does not constrain users to structure their applications into a bunch of map and reduce operations. Spark’s parallel programs look very much like sequential programs, which make them easier to develop and reason about. Finally, Spark allows users to easily combine batch, interactive, and streaming jobs in the same application. As a result, a Spark job can be up to 100 times faster and requires writing 210 times less code than an equivalent Hadoop job. Spark allows users and applications to explicitly cache a dataset by calling the cache() operation. This means that your applications can now access data from RAM instead of disk, which can dramatically improve the performance of iterative algorithms that access the same dataset repeatedly. This use case covers an important class of applications, as all machine learning and graph algorithms are iterative in nature. When constructing a complex pipeline of MapReduce jobs, the task of correctly parallelizing the sequence of jobs is left to you. Thus, a scheduler tool such as Apache Oozie is often required to carefully construct this sequence. With Spark, a whole series of individual tasks is expressed as a single program flow that is lazily evaluated so that the system has a complete picture of the execution graph. This approach allows the core scheduler to correctly map the dependencies across different stages in the application, and automatically parallelize the flow of operators without user intervention. With a low-latency data analysis system at your disposal, it’s natural to extend the engine towards processing live data streams. Spark has an API for working with streams, providing exactly-once semantics and full recovery of stateful operators. It also has the distinct advantage of giving you the same Spark APIs to process your streams, including reuse of your regular Spark application code. Pig on Spark Pig on Spark combines the power and simplicity of Apache Pig on Apache Spark, making existing ETL pipelines 100 times faster than before. We do that via a unique mix of our operator toolkit, called DataDoctor, and Spark. The following are the primary goals for the project: Make data processing more powerful Make data processing more simple Make data processing 100 times faster than before DataDoctor is a high-level operator DSL on top of Spark. It has frameworks for no-symmetrical joins, sorting, grouping, and embedding native Spark functions. It hides a lot of complexity and makes it simple to implement data operators used in applications like Pig and Apache Hive on Spark. Pig operates in a similar manner to big data applications like Hive and Cascading. It has a query language quite akin to SQL that allows analysts and developers to design and write data flows. The query language is translated in to a “logical plan” that is further translated in to a “physical plan” containing operators. Those operators are then run on the designated execution engine (MapReduce, Apache Tez, and now Spark). There are a whole bunch of details around tracking progress, handling errors, and so on that I will skip here. Query planning on Spark will vary significantly from MapReduce, as Spark handles data wrangling in a much more optimized way. Further query planning can benefit greatly from ongoing effort on Catalyst inside Spark. At this moment, we have simply introduced a SparkPlanner that will undertake the conversion from a logical to a physical plan for Pig. Databricks is working actively to enable Catalyst to handle much of the operator optimizations that will plug into SparkPlanner in the near future. Longer term, we plan to rely on Spark itself for logical plan generation. An early version of this integration has been prototyped in partnership with Databricks. Pig Core hands off Spark execution to SparkLauncher with the physical plan. SparkLauncher creates a SparkContext providing all the Pig dependency JAR files and Pig itself. SparkLauncher gets an MR plan object created from the physical plan. At this point, we override all the Pig operators to DataDoctor operators recursively in the whole plan. Two iterations are performed over the plan — one that looks at the store operations and recursively travels down the execution tree, and a second iteration that does a breadth-first traversal over the plan and calls convert on each of the operators. The base class of converters in DataDoctor is a POConverter class and defines the abstract method convert, which is called during plan execution. More details of Pig on Spark can be found at PIG4059. As we merge with Apache Pig, we need to focus on the following enhancements to further improve the speed of Pig: Cache operator: Adding a new operator to explicitly tell Spark to cache certain datasets for faster execution Storage hints: Allowing the user to specify the storage location of datasets in Spark for better control of memory YARN and Mesos support: Adding resource manager support for more global deployment and support Conclusion In many large-scale data applications, statistical perspectives provide us with fruitful analytics in many ways, including speed and efficiency. About the author Praveen Rachabattuni is a tech lead at Sigmoid Analytics, a company that provides a real-time streaming and ETL framework on Apache Spark. Praveen is also a committer to Apache Pig.

0
0
1088

article-image-reducing-cost-big-data-using-statistics-and-memory-technology-part-1

Praveen Rachabattuni

03 Jul 2015

4 min read

Reducing Cost in Big Data using Statistics and In-memory Technology - Part 1

Praveen Rachabattuni

03 Jul 2015

4 min read

The world is shifting from private, dedicated data centers to on-demand computing in the cloud. This shift moves the onus of cost from the hands of IT companies to the hands of developers. As your data sizes start to rise, the computing cost grows linearly with it. We have found that using statistical algorithms gives us a 95 percent accuracy rate, is faster, and is a lot more beneficial than waiting for the exact results. The following are some common analytical queries that we have often come across in applications: How many distinct elements are in the data set (that is, what is the cardinality of the data set)? What are the most frequent elements (that is, the “heavy hitters” and “top elements”)? What are the frequencies of the most frequent elements? Does the data set contain a particular element (search query)? Can you filter data based upon a category? Statistical algorithms for quicker analytics Frequently, statistical algorithms avoid storing the original data, replacing it with hashes that eliminate a lot of network. Let’s get into the details of some of these algorithms that can help answer queries similar to those mentioned previously. A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. It is suitable in cases when we need to quickly filter items that are present in a set. HyperLogLog is an approximate technique for computing the number of distinct entries in a set (cardinality). It does this while using only a small amount of memory. For instance, to achieve 99 percent accuracy, it needs only 16 KB. In cases when we need to count distinct elements in a dataset spread across a Hadoop cluster, we could compute the hashes on different machines, build the bit index, and combine the bit index to compute the overall distinct elements. This eliminates the need of moving the data across the network and thus saves us a lot of time. The Count–min sketch is a probabilistic sub-linear space streaming algorithm that can be used to summarize a data stream to obtain the frequency of elements. It allocates a fixed amount of space to store count information, which does not vary over time even as more and more counts are updated. Nevertheless, it is able to provide useful estimated counts, because the accuracy scales with the total sum of all the counts stored. Spark - a faster execution engine Spark is a faster execution engine that provides 10 times the performance over MapReduce when combined with these statistical algorithms. Using Spark with statistical algorithms gives us a huge benefit both in terms of cost and time savings. Spark gets most of its speed by constructing Directed Acyclic Graphs (DAGs) out of the job operations and uses memory to save intermediate data, thus making the reads faster. When using statistical algorithms, saving the hashes in memory makes the algorithms work much faster. Case study Let’s say we have a continuous stream of user log data coming every hour at a rate of 4.4 GB per hour, and we need to analyze the distinct IPs in the logs on a daily basis. At my old company, when MapReduce was used to process the data, it was taking about 6 hours to process one day’s worth of data at a size of 106 GB. We had an AWS cluster consisting of 50 spot instances and 4 on-demand instances running to perform the analysis at a cost of $150 per day. Our system was then shifted to use Spark and HyperLogLog. This shift brought down the cost to $16.50 per day. To summarize, we had a 3.1 TB stream of data processed every month at a cost of $495, which was costing about $4,500 on the original system using MapReduce without the statistical algorithm in place. Further reading In the second part of this two-part blog series, we will discuss two tools in depth: Apache Spark and Apache Pig. We will take a look at how Pig combined with Spark makes existing ETL pipelines 100 times faster, and we will further our understanding of how statistical perspectives positively effect data analytics. About the author Praveen Rachabattuni is a tech lead at Sigmoid Analytics, a company that provides a real-time streaming and ETL framework on Apache Spark. Praveen is also a committer to Apache Pig.

0
0
1255

Tech Guides - Data

Containerized Data Science with Docker

Tensorflow: Next Gen Machine Learning

Picking up TensorFlow can now pay dividends sooner

Bridging the gap between data science and DevOps with DataOps

Packt Explains... Deep Learning in 90 seconds

Packt Explains... Deep Learning

This Year in Machine Learning

Is Your Machine Learning Plotting To Kill You?

Why an algorithm will never win a Pulitzer

Data Science Is the New Alchemy

Trending Topics

Level Up Your Company's Big Data With Resource Management

Level Up Your Company's Big Data with Mesos

The biggest Big Data & Business Intelligence salary and skills survey of 2015

Reducing Cost in Big Data using Statistics and In-memory Technology - Part 2

Reducing Cost in Big Data using Statistics and In-memory Technology - Part 1