Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

6719 Articles
article-image-how-to-stream-and-store-tweets-in-apache-kafka
Fatema Patrawala
22 Dec 2017
8 min read
Save for later

How to stream and store tweets in Apache Kafka

Fatema Patrawala
22 Dec 2017
8 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book authored by Ankit Jain titled Mastering Apache Storm. This book explores various real-time processing functionalities offered by Apache Storm such as parallelism, data partitioning, and more.[/box] Today, we are going to cover how to stream tweets from Twitter using the twitter streaming API. We are also going to explore how we can store fetched tweets in Kafka for later processing through Storm. Setting up a single node Kafka cluster Following are the steps to set up a single node Kafka cluster:   Download the Kafka 0.9.x binary distribution named kafka_2.10-0.9.0.1.tar.gz from http://apache.claz.org/kafka/0.9.0. or 1/kafka_2.10-0.9.0.1.tgz. Extract the archive to wherever you want to install Kafka with the following command: tar -xvzf kafka_2.10-0.9.0.1.tgz cd kafka_2.10-0.9.0.1   Change the following properties in the $KAFKA_HOME/config/server.properties file: log.dirs=/var/kafka- logszookeeper.connect=zoo1:2181,zoo2:2181,zoo3:2181 Here, zoo1, zoo2, and zoo3 represent the hostnames of the ZooKeeper nodes. The following are the definitions of the important properties in the server.properties file: broker.id: This is a unique integer ID for each of the brokers in a Kafka cluster. port: This is the port number for a Kafka broker. Its default value is 9092. If you want to run multiple brokers on a single machine, give a unique port to each broker. host.name: The hostname to which the broker should bind and advertise itself. log.dirs: The name of this property is a bit unfortunate as it represents not the log directory for Kafka, but the directory where Kafka stores the actual data sent to it. This can take a single directory or a comma-separated list of directories to store data. Kafka throughput can be increased by attaching multiple physical disks to the broker node and specifying multiple data directories, each lying on a different disk. It is not much use specifying multiple directories on the same physical disk, as all the I/O will still be happening on the same disk. num.partitions: This represents the default number of partitions for newly created topics. This property can be overridden when creating new topics. A greater number of partitions results in greater parallelism at the cost of a larger number of files. log.retention.hours: Kafka does not delete messages immediately after consumers consume them. It retains them for the number of hours defined by this property so that in the event of any issues the consumers can replay the messages from Kafka. The default value is 168 hours, which is 1 week. zookeeper.connect: This is the comma-separated list of ZooKeeper nodes in hostname:port form.    Start the Kafka server by running the following command: > ./bin/kafka-server-start.sh config/server.properties [2017-04-23 17:44:36,667] INFO New leader is 0 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) [2017-04-23 17:44:36,668] INFO Kafka version : 0.9.0.1 (org.apache.kafka.common.utils.AppInfoParser) [2017-04-23 17:44:36,668] INFO Kafka commitId : a7a17cdec9eaa6c5 (org.apache.kafka.common.utils.AppInfoParser) [2017-04-23 17:44:36,670] INFO [Kafka Server 0], started (kafka.server.KafkaServer) If you get something similar to the preceding three lines on your console, then your Kafka broker is up-and-running and we can proceed to test it. Now we will verify that the Kafka broker is set up correctly by sending and receiving some test messages. First, let's create a verification topic for testing by executing the following command: > bin/kafka-topics.sh --zookeeper zoo1:2181 --replication-factor 1 --partition 1 --topic verification-topic --create Created topic "verification-topic".    Now let's verify if the topic creation was successful by listing all the topics: > bin/kafka-topics.sh --zookeeper zoo1:2181 --list verification-topic    The topic is created; let's produce some sample messages for the Kafka cluster. Kafka comes with a command-line producer that we can use to produce messages: > bin/kafka-console-producer.sh --broker-list localhost:9092 -- topic verification-topic    Write the following messages on your console: Message 1 Test Message 2 Message 3 Let's consume these messages by starting a new console consumer on a new console window: > bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic verification-topic --from-beginning Message 1 Test Message 2 Message 3 Now, if we enter any message on the producer console, it will automatically be consumed by this consumer and displayed on the command line. Collecting Tweets We are assuming you already have a twitter account, and that the consumer key and access token are generated for your application. You can refer to: https://bdthemes.com/support/knowledge-base/generate-api-key-consumer-token-acc ess-key-twitter-oauth/ to generate a consumer key and access token. Take the following steps: Create a new maven project with groupId, com.storm advance and artifactId, kafka_producer_twitter. Add the following dependencies to the pom.xml file. We are adding the Kafka and Twitter streaming Maven dependencies to pom.xml to support the Kafka Producer and the streaming tweets from Twitter: <dependencies> <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka_2.10</artifactId> <version>0.9.0.1</version> <exclusions> <exclusion> <groupId>com.sun.jdmk</groupId> <artifactId>jmxtools</artifactId> </exclusion> <exclusion> <groupId>com.sun.jmx</groupId> <artifactId>jmxri</artifactId> </exclusion> </exclusions> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-slf4j-impl</artifactId> <version>2.0-beta9</version> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-1.2-api</artifactId> <version>2.0-beta9</version> </dependency> <!-- https://mvnrepository.com/artifact/org.twitter4j/twitter4j-stream --> <dependency> <groupId>org.twitter4j</groupId> <artifactId>twitter4j-stream</artifactId> <version>4.0.6</version> </dependency> </dependencies> 3. Now, we need to create a class, TwitterData, that contains the code to consume/stream data from Twitter and publish it to the Kafka cluster. We are assuming you already have a running Kafka cluster and topic, twitterData, created in the Kafka cluster. for information on the installation of the Kafka cluster and the creation of a Kafka please refer to . The class contains an instance of the twitter4j.conf.ConfigurationBuilder class; we need to set the access token and consumer keys in configuration, as mentioned in the source code.4. The twitter4j.StatusListener class returns the continuous stream of tweets inside the onStatus() method. We are using the Kafka Producer code inside the onStatus() method to publish the tweets in Kafka. The following is the source code for the TwitterData class: public class TwitterData { /** The actual Twitter stream. It's set up to collect raw JSON data */ private TwitterStream twitterStream; static String consumerKeyStr = "r1wFskT3q"; static String consumerSecretStr = "fBbmp71HKbqalpizIwwwkBpKC"; static String accessTokenStr = "298FPfE16frABXMcRIn7aUSSnNneMEPrUuZ"; static String accessTokenSecretStr = "1LMNZZIfrAimpD004QilV1pH3PYTvM"; public void start() { ConfigurationBuilder cb = new ConfigurationBuilder(); cb.setOAuthConsumerKey(consumerKeyStr); cb.setOAuthConsumerSecret(consumerSecretStr); cb.setOAuthAccessToken(accessTokenStr); cb.setOAuthAccessTokenSecret(accessTokenSecretStr); cb.setJSONStoreEnabled(true); cb.setIncludeEntitiesEnabled(true); // instance of TwitterStreamFactory twitterStream = new TwitterStreamFactory(cb.build()).getInstance(); final Producer<String, String> producer = new KafkaProducer<String, String>(getProducerConfig());// topicDetails CreateTopic("127.0.0.1:2181").createTopic("twitterData", 2, 1); /** Twitter listener **/ StatusListener listener = new StatusListener() { public void onStatus(Status status) { ProducerRecord<String, String> data = new ProducerRecord<String, String>("twitterData", DataObjectFactory.getRawJSON(status)); // send the data to kafka producer.send(data); } public void onException(Exception arg0) { System.out.println(arg0); } arg0) {  }; public void onDeletionNotice(StatusDeletionNotice } public void onScrubGeo(long arg0, long arg1) { } public void onStallWarning(StallWarning arg0) { } public void onTrackLimitationNotice(int arg0) { } /** Bind the listener **/ twitterStream.addListener(listener); /** GOGOGO **/ twitterStream.sample(); } private Properties getProducerConfig() { Properties props = new Properties(); // List of kafka borkers. Complete list of brokers is not required as // the producer will auto discover the rest of the brokers. props.put("bootstrap.servers", "localhost:9092"); props.put("batch.size", 1); // new sending // Serializer used for sending data to kafka. Since we are // string, // we are using StringSerializer. props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer"); props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer"); props.put("producer.type", "sync"); return props; } public static void main(String[] args) throws InterruptedException { new TwitterData().start(); } Use valid Kafka properties before executing the TwitterData. After executing the preceding class, the user will have a real-time stream of Twitter tweets in Kafka. In the next section, we are going to cover how we can use Storm to calculate the sentiments of the collected tweets. To summarize we covered how to install a single node Apache Kafka cluster and how to collect tweet from Twitter to store in a Kafka cluster If you enjoyed this post, check out the book Mastering Apache Storm to know more about different types of real time processing techniques used to create distributed applications.
Read more
  • 0
  • 0
  • 6101

article-image-data-science-saved-christmas
Aaron Lazar
22 Dec 2017
9 min read
Save for later

How Data Science saved Christmas

Aaron Lazar
22 Dec 2017
9 min read
It’s the middle of December and it’s shivery cold in the North Pole at -20°C. A fat old man sits on a big brown chair, beside the fireplace, stroking his long white beard. His face has a frown on it, quite his unusual self. Mr. Claus quips, “Ruddy mailman should have been here by now! He’s never this late to bring in the li'l ones’ letters.” [caption id="attachment_3284" align="alignleft" width="300"] Nervous Santa Claus on Christmas Eve, he is sitting on the armchair and resting head on his hands[/caption] Santa gets up from his chair, his trouser buttons crying for help, thanks to his massive belly. He waddles over to the window and looks out. He’s sad that he might not be able to get the children their gifts in time, this year. Amidst the snow, he can see a glowing red light. “Oh Rudolph!” he chuckles. All across the living room are pictures of little children beaming with joy, holding their presents in their hands. A small smile starts building and then suddenly, Santa gets a new-found determination to get the presents over to the children, come what may! An idea strikes him as he waddles over to his computer room. Now Mr. Claus may be old on the outside, but on the inside, he’s nowhere close! He recently set up a new rig, all by himself. Six Nvidia GTX Titans, coupled with sixteen gigs of RAM, a 40-inch curved monitor that he uses to keep an eye on who’s being naughty or nice, and a 1000 watt home theater system, with surround sound, heavy on the bass. On the inside, he’s got a whole load of software on the likes of the Python language (not the Garden of Eden variety), OpenCV - his all-seeing eye that’s on the kids and well, Tensorflow et al. Now, you might wonder what an old man is doing with such heavy software and hardware. A few months ago, Santa caught wind that there’s a new and upcoming trend that involves working with tonnes of data, cleaning, processing and making sense of it. The idea of crunching data somehow tickled the old man and since then, the jolly good master tinkerer and his army of merry elves have been experimenting away with data. Santa’s pretty much self-taught at whatever he does, be it driving a sleigh or learning something new. A couple of interesting books he picked up from Packt were, Python Data Science Essentials - Second Edition, Hands-On Data Science and Python Machine Learning, and Python Machine Learning - Second Edition. After spending some time on the internet, he put together a list of things he needed to set up his rig and got them from Amazon. [caption id="attachment_3281" align="alignright" width="300"] Santa Claus is using a laptop on the top of a house[/caption] He quickly boots up the computer and starts up Tensorflow. He needs to come up with a list of probable things that each child would have wanted for Christmas this year. Now, there are over 2 billion children in the world and finding each one’s wish is going to be more than a task! But nothing is too difficult for Santa! He gets to work, his big head buried in his keyboard, his long locks falling over his shoulder. So, this was his plan: Considering that the kids might have shared their secret wish with someone, Santa plans to tackle the problem from different angles, to reach a higher probability of getting the right gifts: He plans to gather email and Social Media data from all the kids’ computers - all from the past month It’s a good thing kids have started owning phones at such an early age now - he plans to analyze all incoming and outgoing phone calls that have happened over the course of the past month He taps into every country's local police department’s records to stream all security footage all over the world [caption id="attachment_3288" align="alignleft" width="300"] A young boy wearing a red Christmas hat and red sweater is writing a letter to Santa Claus. The child is sitting at a wooden table in front of a Christmas tree.[/caption] If you’ve reached till here, you’re probably wondering whether this article is about Mr.Claus or Mr.Bond. Yes, the equipment and strategy would have fit an MI6 or a CIA agent’s role. You never know, Santa might just be a retired agent. Do they ever retire? Hmm! Anyway, it takes a while before he can get all the data he needs. He trusts Spark to sort this data in order, which is stored in a massive data center in his basement (he’s a bit cautious after all the news about data breaches). And he’s off to work! He sifts through the emails and messages, snorting from time to time at some of the hilarious ones. Tensorflow rips through the data, picking out keywords for Santa. It takes him a few hours to get done with the emails and social media data alone! By the time he has a list, it’s evening and time for supper. Santa calls it a day and prepares to continue the next day. The next day, Santa gets up early and boots up his equipment as he brushes and flosses. He plonks himself in the huge swivel chair in front of the monitor, munching on freshly baked gingerbread. He starts tapping into all the phone company databases across the world, fetching all the data into his data center. Now, Santa can’t afford to spend the whole time analyzing voices himself, so he lets Tensorflow analyze voices and segregate the keywords it picks up from the voice signals. Every kid’s name to a possible gift. Now there were a lot of unmentionable things that got linked to several kids names. Santa almost fell off his chair when he saw the list. “These kids grow up way too fast, these days!” It’s almost 7 PM in the evening when Santa realizes that there’s way too much data to process in a day. A few days later, Santa returns to his tech abode, to check up on the progress of the call data processing. There’s a huge list waiting in front of him. He thinks to himself, “This will need a lot of cleaning up!” He shakes his head thinking, I should have started with this! He now has to munge through that camera footage! Santa had never worked on so much data before so he started to get a bit worried that he might be unable to analyze it in time. He started pacing around the room trying to think up a workaround. Time was flying by and he still did not know how to speed up the video analyses. Just when he’s about to give up, the door opens and Beatrice walks in. Santa almost trips as he runs to hug his wife! Beatrice is startled for a bit but then breaks into a smile. “What is it dear? Did you miss me so much?” Santa replies, “You can’t imagine how much! I’ve been doing everything on my own and I really need your help!” Beatrice smiles and says, “Well, what are we waiting for? Let’s get down to it!” Santa explains the problem to Beatrice in detail and tells her how far he’s reached in the analysis. Beatrice thinks for a bit and asks Santa, “Did you try using Keras on top of TensorFlow?” Santa, blank for a minute, nods his head. Beatrice continues, “Well from my experience, Keras gives TensorFlow a boost of about 10%, which should help quicken the analysis. Santa just looks like he’s made the best decision marrying Beatrice and hugs her again! “Bea, you’re a genius!” he cries out. “Yeah, and don’t forget to use Matplotlib!” she yells back as Santa hurries back to his abode. He’s off to work again, this time saddling up Keras to work on top of TensorFlow. Hundreds and thousands of terabytes of video data flowing into the machines. He channels the output through OpenCV and ties it with TensorFlow to add a hint of Deep Learning. He quickly types out some Python scripts to integrate both the tools to create the optimal outcome. And then the wait begins. Santa keeps looking at his watch every half hour, hoping that the processing happens fast. The hardware has begun heating up quite a bit and he quickly races over to bring a cooler that’s across the room. While he waits for the videos to finish up, he starts working on sifting out the data from the text and audio. He remembers what Beatrice said and uses Matplotlib to visualize it. Soon he has a beautiful map of the world with all the children’s names and their possible gifts beside. Three days later, the video processing gets done Keras truly worked wonders for TensorFlow! Santa now has another set of data to help him narrow down the gift list. A few hours later he’s got his whole list visualized on Matplotlib. [caption id="attachment_3289" align="alignleft" width="300"] Santa Claus riding on sleigh with gift box against snow falling on fir tree forest[/caption] There’s one last thing left to do! He suits up in red and races out the door to Rudolph and the other reindeer, unties them from the fence and leads them over to the sleigh. Once they’re fastened, he loads up an empty bag onto the sleigh and it magically gets filled up. He quickly checks it to see if all is well and they’re off! It’s Christmas morning and all the kids are racing out of bed to rip their presents open! There are smiles all around and everyone’s got a gift, just as the saying goes! Even the ones who’ve been naughty have gotten gifts. Back in the North Pole, the old man is back in his abode, relaxing in an easy chair with his legs up on the table. The screen in front of him runs real-time video feed of kids all over the world opening up their presents. A big smile on his face, Santa turns to look out the window at the glowing red light amongst the snow, he takes a swig of brandy from a hip flask. Thanks to Data Science, this Christmas is the merriest yet!
Read more
  • 0
  • 0
  • 3744

article-image-customizing-deep-learning-models-keras
Amey Varangaonkar
22 Dec 2017
8 min read
Save for later

2 ways to customize your deep learning models with Keras

Amey Varangaonkar
22 Dec 2017
8 min read
[box type="note" align="" class="" width=""]The following extract is taken from the book Deep Learning with Keras, co-authored by Antonio Gulli and Sujit Pal. [/box] Keras has a lot of built-in functionality for you to build all your deep learning models without much need for customization. In this article, the authors explain how your Keras models can be customized for better and more efficient deep learning. As you will recall, Keras is a high level API that delegates to either a TensorFlow or Theano backend for the computational heavy lifting. Any code you build for your customization will call out to one of these backends. In order to keep your code portable across the two backends, your custom code should use the Keras backend API (https://keras.io/backend/), which provides a set of functions that act like a facade over your chosen backend. Depending on the backend selected, the call to the backend facade will translate to the appropriate TensorFlow or Theano call. The full list of functions available and their detailed descriptions can be found on the Keras backend page. In addition to portability, using the backend API also results in more maintainable code, since Keras code is generally more high-level and compact compared to equivalent TensorFlow or Theano code. In the unlikely case that you do need to switch to using the backend directly, your Keras components can be used directly inside TensorFlow (not Theano though) code as described in this Keras blog (https://blog.keras.io/keras-as-a-simplified-interface-to-tensorflow-tutorial.html) Customizing Keras typically means writing your own custom layer or custom distance function. In this section, we will demonstrate how to build some simple Keras layers. You will see more examples of using the backend functions to build other custom Keras components, such as objectives (loss functions), in subsequent sections. Keras example — using the lambda layer Keras provides a lambda layer; it can wrap a function of your choosing. For example, if you wanted to build a layer that squares its input tensor element-wise, you can say simply: model.add(lambda(lambda x: x ** 2)) You can also wrap functions within a lambda layer. For example, if you want to build a custom layer that computes the element-wise euclidean distance between two input tensors, you would define the function to compute the value itself, as well as one that returns the output shape from this function, like so: def euclidean_distance(vecs): x, y = vecs return K.sqrt(K.sum(K.square(x - y), axis=1, keepdims=True)) def euclidean_distance_output_shape(shapes): shape1, shape2 = shapes return (shape1[0], 1) You can then call these functions using the lambda layer shown as follows: lhs_input = Input(shape=(VECTOR_SIZE,)) lhs = dense(1024, kernel_initializer="glorot_uniform", activation="relu")(lhs_input) rhs_input = Input(shape=(VECTOR_SIZE,)) rhs = dense(1024, kernel_initializer="glorot_uniform", activation="relu")(rhs_input) sim = lambda(euclidean_distance, output_shape=euclidean_distance_output_shape)([lhs, rhs]) Keras example - building a custom normalization layer While the lambda layer can be very useful, sometimes you need more control. As an example, we will look at the code for a normalization layer that implements a technique called local response normalization. This technique normalizes the input over local input regions, but has since fallen out of favor because it turned out not to be as effective as other regularization methods such as dropout and batch normalization, as well as better initialization methods. Building custom layers typically involves working with the backend functions, so it involves thinking about the code in terms of tensors. As you will recall, working with tensors is a two step process. First, you define the tensors and arrange them in a computation graph, and then you run the graph with actual data. So working at this level is harder than working in the rest of Keras. The Keras documentation has some guidelines for building custom layers (https://keras.io/layers/writing-your-own-keras-layers/), which you should definitely read. One of the ways to make it easier to develop code in the backend API is to have a small test harness that you can run to verify that your code is doing what you want it to do. Here is a small harness I adapted from the Keras source to run your layer against some input and return a result: from keras.models import Sequential from keras.layers.core import Dropout, Reshape def test_layer(layer, x): layer_config = layer.get_config() layer_config["input_shape"] = x.shape layer = layer.__class__.from_config(layer_config) model = Sequential() model.add(layer) model.compile("rmsprop", "mse") x_ = np.expand_dims(x, axis=0) return model.predict(x_)[0] And here are some tests with layer objects provided by Keras to make sure that the harness runs okay: from keras.layers.core import Dropout, Reshape from keras.layers.convolutional import ZeroPadding2D import numpy as np x = np.random.randn(10, 10) layer = Dropout(0.5) y = test_layer(layer, x) assert(x.shape == y.shape) x = np.random.randn(10, 10, 3) layer = ZeroPadding2D(padding=(1,1)) y = test_layer(layer, x) assert(x.shape[0] + 2 == y.shape[0]) assert(x.shape[1] + 2 == y.shape[1]) x = np.random.randn(10, 10) layer = Reshape((5, 20)) y = test_layer(layer, x) assert(y.shape == (5, 20)) Before we begin building our local response normalization layer, we need to take a moment to understand what it really does. This technique was originally used with Caffe, and the Caffe documentation (http://caffe.berkeleyvision.org/tutorial/layers/lrn.html), describes it as a kind of lateral inhibition that works by normalizing over local input regions. In ACROSS_CHANNEL mode, the local regions extend across nearby channels but have no spatial extent. In WITHIN_CHANNEL mode, the local regions extend spatially, but are in separate channels. We will implement the WITHIN_CHANNEL model as follows. The formula for local response normalization in the WITHIN_CHANNEL model is given by: The code for the custom layer follows the standard structure. The __init__ method is used to set the application specific parameters, that is, the hyperparameters associated with the layer. Since our layer only does a forward computation and doesn't have any learnable weights, all we do in the build method is to set the input shape and delegate to the superclass's build method, which takes care of any necessary book-keeping. In layers where learnable weights are involved, this method is where you would set the initial values. The call method does the actual computation. Notice that we need to account for dimension ordering. Another thing to note is that the batch size is usually unknown at design times, so you need to write your operations so that the batch size is not explicitly invoked. The computation itself is fairly straightforward and follows the formula closely. The sum in the denominator can also be thought of as average pooling over the row and column dimension with a padding size of (n, n) and a stride of (1, 1). Because the pooled data is averaged already, we no longer need to divide the sum by n. The last part of the class is the get_output_shape_for method. Since the layer normalizes each element of the input tensor, the output size is identical to the input size: from keras import backend as K from keras.engine.topology import Layer, InputSpec class LocalResponseNormalization(Layer): def __init__(self, n=5, alpha=0.0005, beta=0.75, k=2, **kwargs): self.n = n self.alpha = alpha self.beta = beta self.k = k super(LocalResponseNormalization, self).__init__(**kwargs) def build(self, input_shape): self.shape = input_shape super(LocalResponseNormalization, self).build(input_shape) def call(self, x, mask=None): if K.image_dim_ordering == "th": _, f, r, c = self.shape Else: _, r, c, f = self.shape squared = K.square(x) pooled = K.pool2d(squared, (n, n), strides=(1, 1), padding="same", pool_mode="avg") if K.image_dim_ordering == "th": summed = K.sum(pooled, axis=1, keepdims=True) averaged = self.alpha * K.repeat_elements(summed, f, axis=1) Else: summed = K.sum(pooled, axis=3, keepdims=True) averaged = self.alpha * K.repeat_elements(summed, f, axis=3) denom = K.pow(self.k + averaged, self.beta) return x / denom def get_output_shape_for(self, input_shape): return input_shape You can test this layer during development using the test harness we described here. It is easier to run this instead of trying to build a whole network to put this into, or worse, waiting till you have fully specified the layer before running it: x = np.random.randn(225, 225, 3) layer = LocalResponseNormalization() y = test_layer(layer, x) assert(x.shape == y.shape) Now that you have a good idea of how to build a custom Keras layer, you might find it instructive to look at Keunwoo Choi's melspectogram (https://keunwoochoi.wordpress.com/2016/11/18/for-beginners-writing-a-custom-keras-layer/) Though building custom Keras layers seems to be fairly commonplace for experienced Keras developers, but they may not be widely useful in a general context. Custom layers are usually built to serve a specific narrow purpose, depending on the use-case in question, and Keras gives you enough flexibility to do so with ease. If you found our post useful, make sure to check out our best selling title Deep Learning with Keras, for other intriguing deep learning concepts and their implementation using Keras.    
Read more
  • 0
  • 1
  • 18359

article-image-exploratory-data-analysis-eda-spark-sql
Amarabha Banerjee
21 Dec 2017
7 min read
Save for later

How to perform Exploratory Data Analysis (EDA) with Spark SQL

Amarabha Banerjee
21 Dec 2017
7 min read
[box type="note" align="" class="" width=""]Below given post is a book excerpt taken from Learning Spark SQL written by Aurobindo Sarkar. This book will help you design, implement, and deliver successful streaming applications, machine learning pipelines and graph applications using Spark SQL API.[/box] Our article aims to give you an understanding of how exploratory data analysis is performed with Spark SQL. What is Exploratory Data Analysis (EDA) Exploratory Data Analysis (EDA), or Initial Data Analysis (IDA), is an approach to data analysis that attempts to maximize insight into data. This includes assessing the quality and structure of the data, calculating summary or descriptive statistics, and plotting appropriate graphs. It can uncover underlying structures and suggest how the data should be modeled. Furthermore, EDA helps us detect outliers, errors, and anomalies in our data, and deciding what to do about such data is often more important than other, more sophisticated analysis. EDA enables us to test our underlying assumptions, discover clusters and other patterns in our data, and identify the possible relationships between various variables. A careful EDA process is vital to understanding the data and is sometimes sufficient to reveal such poor data quality that using a more sophisticated model-based analysis is not justified. Typically, the graphical techniques used in EDA are simple, consisting of plotting the raw data and simple statistics. The focus is on the structures and models revealed by the data or best fit the data. EDA techniques include scatter plots, box plots, histograms, probability plots, and so on. In most EDA techniques, we use all of the data, without making any underlying assumptions. The analyst builds intuition, or gets a "feel", for the Dataset as a result of such exploration. More specifically, the graphical techniques allow us to efficiently select and validate appropriate models, test our assumptions, identify relationships, select estimators, detect outliers, and so on. EDA involves a lot of trial and error, and several iterations. The best way is to start simple and then build in complexity as you go along. There is a major trade-off in modeling between the simple and the more accurate ones. Simple models may be much easier to interpret and understand. These models can get you to 90% accuracy very quickly, versus a more complex model that might take weeks or months to get you an additional 2% improvement. For example, you should plot simple histograms and scatter plots to quickly start developing an intuition for your data. Using Spark SQL for basic data analysis Interactively, processing and visualizing large data is challenging as the queries can take a long time to execute and the visual interface cannot accommodate as many pixels as data points. Spark supports in-memory computations and a high degree of parallelism to achieve interactivity with large distributed data. In addition, Spark is capable of handling petabytes of data and provides a set of versatile programming interfaces and libraries. These include SQL, Scala, Python, Java and R APIs, and libraries for distributed statistics and machine learning. For data that fits into a single computer, there are many good tools available, such as R, MATLAB, and others. However, if the data does not fit into a single machine, or if it is very complicated to get the data to that machine, or if a single computer cannot easily process the data, then this section will offer some good tools and techniques for data exploration. In this section, we will go through some basic data exploration exercises to understand a sample Dataset. We will use a Dataset that contains data related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The marketing campaigns were based on phone calls to customers. We'll use the bank-additional-full.csv file that contains 41,188 records and 20 input fields, ordered by date (from May 2008 to November 2010). The Dataset has been contributed by S. Moro, P. Cortez, and P. Rita, and can be downloaded from https://archive.ics.uci.edu/ml/datasets/Bank+Marketing. As a first step, let's define a schema and read in the CSV file to create a DataFrame. You can use :paste command to paste initial set of statements in your Spark shell session (use Ctrl+D to exit the paste mode), as shown: 2. After the DataFrame has been created, we first verify the number of records: We can also define a case class called Call for our input records, and then create a strongly-typed Dataset, as follows: Identifying missing data Missing data can occur in Datasets due to reasons ranging from negligence to a refusal on the part of respondents to provide a specific data point. However, in all cases, missing data is a common occurrence in real-world Datasets. Missing data can create problems in data analysis and sometimes lead to wrong decisions or conclusions. Hence, it is very important to identify missing data and devise effective strategies to deal with it. In this section, we analyze the numbers of records with missing data fields in our sample Dataset. In order to simulate missing data, we will edit our sample Dataset by replacing fields containing "unknown" values with empty strings. First, we created a DataFrame/Dataset from our edited file, as shown: In the next section, we will compute some basic statistics for our sample Dataset to improve our understanding of the data. Computing basic statistics Computing basic statistics is essential for a good preliminary understanding of our data. First, for convenience, we create a case class and a Dataset containing a subset of fields from our original DataFrame. In the following example, we choose some of the numeric fields and the outcome field, that is, the "term deposit subscribed" field: Next, we use describe() compute the count, mean, stdev, min, and max values for the numeric columns in our Dataset. The describe() command gives a way to do a quick sense-check on your data. For example, the counts of rows of each of the columns selected matches the total number records in the DataFrame (no null or invalid rows),whether the average and range of values for the age column matching your expectations, and so on. Based on the values of the means and standard deviations, you can get select certain data elements for deeper analysis. For example, assuming normal distribution, the mean and standard deviation values for age suggest most values of age are in the range 30 to 50 years, for other columns the standard deviation values may be indicative of a skew in the data (as the standard deviation is greater than the mean). Identifying data outliers An outlier or an anomaly is an observation of the data that deviates significantly from other observations in the Dataset. These erroneous outliers can be due to errors in the data collection or variability in measurement. They can impact the results significantly so it is imperative to identify them during the EDA process. However, these techniques define outliers as points, which do not lie in clusters. The user has to model the data points using statistical distributions, and the outliers are identified depending on how they appear in relation to the underlying model. The main problem with these approaches is that during EDA, the user typically does not have enough knowledge about the underlying data distribution. EDA, using a modeling and visualizing approach, is a good way of achieving a deeper intuition of our data. Spark MLlib supports a large (and growing) set of distributed machine learning algorithms to make this task simpler. For example, we can apply clustering algorithms and visualize the results to detect outliers in a combination columns. In the following example, we use the last contact duration, in seconds (duration), number of contacts performed during this campaign, for this client (campaign), number of days that have passed by after the client was last contacted from a previous campaign (pdays) and the previous: number of contacts performed before this campaign and for this client (prev) values to compute two clusters in our data by applying the k-means clustering algorithm: If you liked this article, please be sure to check out Learning Spark SQL which will help you learn more useful techniques on data extraction and data analysis.    
Read more
  • 0
  • 0
  • 8457

article-image-implementing-row-level-security-in-postgresql
Amey Varangaonkar
21 Dec 2017
7 min read
Save for later

Implementing Row-level Security in PostgreSQL

Amey Varangaonkar
21 Dec 2017
7 min read
[box type="note" align="" class="" width=""]The following excerpt is taken from the book Mastering PostgreSQL 9.6, authored by Hans-Jürgen Schönig. The book gives a comprehensive primer on different features and capabilities of PostgreSQL 9.6, and how you can leverage them efficiently to administer and manage your PostgreSQL database.[/box] In this article, we discuss the concept of row-level security and how effectively it can be implemented in PostgreSQL using a interesting example. Having the row-level security feature enables allows you to store data for multiple users in a single database and table. At the same time it sets restrictions on the row-level access, based on a particular user’s role or identity. What is Row-level Security? In usual cases, a table is always shown as a whole. When the table contains 1 million rows, it is possible to retrieve 1 million rows from it. If somebody had the rights to read a table, it was all about the entire table. In many cases, this is not enough. Often it is desirable that a user is not allowed to see all the rows. Consider the following real-world example: an accountant is doing accounting work for many people. The table containing tax rates should really be visible to everybody as everybody has to pay the same rates. However, when it comes to the actual transactions, you might want to ensure that everybody is only allowed to see his or her own transactions. Person A should not be allowed to see person B's data. In addition to that, it might also make sense that the boss of a division is allowed to see all the data in his part of the company. Row-level security has been designed to do exactly this and enables you to build multi-tenant systems in a fast and simple way. The way to configure those permissions is to come up with policies. The CREATE POLICY command is here to provide you with a means to write those rules: test=# h CREATE POLICY Command: CREATE POLICY Description: define a new row level security policy for a table Syntax: CREATE POLICY name ON table_name [ FOR { ALL | SELECT | INSERT | UPDATE | DELETE } ] [ TO { role_name | PUBLIC | CURRENT_USER | SESSION_USER } [, ...] ] [ USING ( using_expression ) ] [ WITH CHECK ( check_expression ) ] To show you how a policy can be written, I will first log in as superuser and create a table containing a couple of entries: test=# CREATE TABLE t_person (gender text, name text); CREATE TABLE test=# INSERT INTO t_person VALUES ('male', 'joe'), ('male', 'paul'), ('female', 'sarah'), (NULL, 'R2- D2'); INSERT 0 4 Then access is granted to the joe role: test=# GRANT ALL ON t_person TO joe; GRANT So far, everything is pretty normal and the joe role will be able to actually read the entire table as there is no RLS in place. But what happens if row-level security is enabled for the table? test=# ALTER TABLE t_person ENABLE ROW LEVEL SECURITY; ALTER TABLE There is a deny all default policy in place, so the joe role will actually get an empty table: test=> SELECT * FROM t_person; gender | name --------+------ (0 rows) Actually, the default policy makes a lot of sense as users are forced to explicitly set permissions. Now that the table is under row-level security control, policies can be written (as superuser): test=# CREATE POLICY joe_pol_1 ON t_person FOR SELECT TO joe USING (gender = 'male'); CREATE POLICY Logging in as the joe role and selecting all the data, will return just two rows: test=> SELECT * FROM t_person; gender | name --------+------ male | joe male | paul (2 rows) Let us inspect the policy I have just created in a more detailed way. The first thing you see is that a policy actually has a name. It is also connected to a table and allows for certain operations (in this case, the SELECT clause). Then comes the USING clause. It basically defines what the joe role will be allowed to see. The USING clause is therefore a mandatory filter attached to every query to only select the rows our user is supposed to see. Now suppose that, for some reason, it has been decided that the joe role is also allowed to see robots. There are two choices to achieve our goal. The first option is to simply use the ALTER POLICY clause to change the existing policy: test=> h ALTER POLICY Command: ALTER POLICY Description: change the definition of a row level security policy Syntax: ALTER POLICY name ON table_name RENAME TO new_name ALTER POLICY name ON table_name [ TO { role_name | PUBLIC | CURRENT_USER | SESSION_USER } [, ...] ] [ USING ( using_expression ) ] [ WITH CHECK ( check_expression ) ] The second option is to create a second policy as shown in the next example: test=# CREATE POLICY joe_pol_2 ON t_person FOR SELECT TO joe USING (gender IS NULL); CREATE POLICY The beauty is that those policies are simply connected using an OR condition.Therefore, PostgreSQL will now return three rows instead of two: test=> SELECT * FROM t_person; gender | name --------+------- male | joe male | paul | R2-D2 (3 rows) The R2-D2 role is now also included in the result as it matches the second policy. To show you how PostgreSQL runs the query, I have decided to include an execution plan of the query: test=> explain SELECT * FROM t_person; QUERY PLAN ---------------------------------------------------------- Seq Scan on t_person (cost=0.00..21.00 rows=9 width=64) Filter: ((gender IS NULL) OR (gender = 'male'::text)) (2 rows) As you can see, both the USING clauses have been added as mandatory filters to the query. You might have noticed in the syntax definition that there are two types of clauses: USING: This clause filters rows that already exist. This is relevant to SELECT and UPDATE clauses, and so on. CHECK: This clause filters new rows that are about to be created; so they are relevant to INSERT  and UPDATE clauses, and so on. Here is what happens if we try to insert a row: test=> INSERT INTO t_person VALUES ('male', 'kaarel'); ERROR: new row violates row-level security policy for table "t_person" As there is no policy for the INSERT clause, the statement will naturally error out. Here is the policy to allow insertions: test=# CREATE POLICY joe_pol_3 ON t_person FOR INSERT TO joe WITH CHECK (gender IN ('male', 'female')); CREATE POLICY The joe role is allowed to add males and females to the table, which is shown in the next listing: test=> INSERT INTO t_person VALUES ('female', 'maria'); INSERT 0 1 However, there is also a catch; consider the following example: test=> INSERT INTO t_person VALUES ('female', 'maria') RETURNING *; ERROR: new row violates row-level security policy for table "t_person" Remember, there is only a policy to select males. The trouble here is that the statement will return a woman, which is not allowed because joe role is under a male only policy. Only for men, will the RETURNING * clause actually work: test=> INSERT INTO t_person VALUES ('male', 'max') RETURNING *; gender | name --------+------ male | max (1 row) INSERT 0 1 If you don't want this behavior, you have to write a policy that actually contains a proper USING clause. If you liked our post, make sure to check out our book Mastering PostgreSQL 9.6 - a comprehensive PostgreSQL guide covering all database administration and maintenance aspects.
Read more
  • 0
  • 0
  • 8225

article-image-how-to-install-keras-on-docker-and-cloud-ml
Amey Varangaonkar
20 Dec 2017
3 min read
Save for later

How to Install Keras on Docker and Cloud ML

Amey Varangaonkar
20 Dec 2017
3 min read
[box type="note" align="" class="" width=""]The following extract is taken from the book Deep Learning with Keras, written by Antonio Gulli and Sujit Pal. It contains useful techniques to train effective deep learning models using the highly popular Keras library.[/box] Keras is a deep learning library which can be used on the enterprise platform, by deploying it on a container. In this article, we see how to install Keras on Docker and Google’s Cloud ML. Installing Keras on Docker One of the easiest ways to get started with TensorFlow and Keras is running in a Docker container. A convenient solution is to use a predefined Docker image for deep learning created by the community that contains all the popular DL frameworks (TensorFlow, Theano, Torch, Caffe, and so on). Refer to the GitHub repository at https://github.com/saiprashanths/dl-docker for the code files. Assuming that you already have Docker up and running (for more information, refer to https://www.docker.com/products/overview), installing it is pretty simple and is shown as follows: The following screenshot, says something like, after getting the image from Git, we build the Docker image: In this following screenshot, we see how to run it: From within the container, it is possible to activate support for Jupyter Notebooks (for more information, refer to http://jupyter.org/): Access it directly from the host machine on port: It is also possible to access TensorBoard (for more information, refer to https://www.tensorflow.org/how_tos/summaries_and_tensorboard/) with the help of the command in the screenshot that follows, which is discussed in the next section: After running the preceding command, you will be redirected to the following page: Installing Keras on Google Cloud ML Installing Keras on Google Cloud is very simple. First, we can install Google Cloud (for the downloadable file, refer to https://cloud.google.com/sdk/), a command-line interface for Google Cloud Platform; then we can use CloudML, a managed service that enables us to easily build machine, learning models with TensorFlow. Before using Keras, let's use Google Cloud with TensorFlow to train an MNIST example available on GitHub. The code is local and training happens in the cloud: In the following screenshot, you can see how to run a training session: We can use TensorBoard to show how cross-entropy decreases across iterations: In the next screenshot, we see the graph of cross-entropy: Now, if we want to use Keras on the top of TensorFlow, we simply download the Keras source from PyPI (for the downloadable file, refer to https://pypi.python.org/pypi/Keras/1.2.0 or later versions) and then directly use Keras as a CloudML package solution, as in the following example: Here, trainer.task2.py is an example script: from keras.applications.vgg16 import VGG16 from keras.models import Model from keras.preprocessing import image from keras.applications.vgg16 import preprocess_input import numpy as np # pre-built and pre-trained deep learning VGG16 model base_model = VGG16(weights='imagenet', include_top=True) for i, layer in enumerate(base_model.layers): print (i, layer.name, layer.output_shape) Thus we saw, how fairly easy it is to set up and run Keras on a Docker container and Cloud ML. If this article interested you, make sure to check out our book Deep Learning with Keras, where you can learn to install Keras on other popular platforms such as Amazon Web Services and Microsoft Azure.  
Read more
  • 0
  • 0
  • 3808
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
article-image-9-useful-r-packages-for-nlp-text-mining
Amey Varangaonkar
18 Dec 2017
6 min read
Save for later

9 Useful R Packages for NLP & Text Mining

Amey Varangaonkar
18 Dec 2017
6 min read
[box type="note" align="" class="" width=""]The following excerpt is taken from the book Mastering Text Mining with R, co-authored by Ashish Kumar and Avinash Paul. This book lists various techniques to extract useful and high-quality information from your textual data.[/box] There is a wide range of packages available in R for natural language processing and text mining. In the article below, we present some of the popular and widely used R packages for NLP: OpenNLP OpenNLP is an R package which provides an interface, Apache OpenNLP, which is a  machine-learning-based toolkit written in Java for natural language processing activities. Apache OpenNLP is widely used for most common tasks in NLP, such as tokenization, POS tagging, named entity recognition (NER), chunking, parsing, and so on. It provides wrappers for Maxent entropy models using the Maxent Java package. It provides functions for sentence annotation, word annotation, POS tag annotation, and annotation parsing using the Apache OpenNLP chunking parser. The Maxent Chunk annotator function computes the chunk annotation using the Maxent chunker provided by OpenNLP. The Maxent entity annotator function in R package utilizes the Apache OpenNLP Maxent name finder for entity annotation. Model files can be downloaded from http://opennlp.sourceforge.net/models-1.5/. These language models can be effectively used in R packages by installing the OpenNLPmodels.language package from the repository at http://datacube.wu.ac.at. Get the OpenNLP package here. Rweka The RWeka package in R provides an interface to Weka. Weka is an open source software developed by a machine learning group at the University of Wakaito, which provides a wide range of machine learning algorithms which can either be directly applied to a dataset or it can be called from a Java code. Different data-mining activities, such as data processing, supervised and unsupervised learning, association mining, and so on, can be performed using the RWeka package. For natural language processing, RWeka provides tokenization and stemming functions. RWeka packages provide an interface to Alphabetic, NGramTokenizers, and wordTokenizer functions, which can efficiently perform tokenization for contiguous alphabetic sequence, string-split to n-grams, or simple word tokenization, respectively. Get started with Rweka here. RcmdrPlugin.temis The RcmdrPlugin.temis package in R provides a graphical integrated text-mining solution. This package can be leveraged for many text-mining tasks, such as importing and cleaning a corpus, terms and documents count, term co-occurrences, correspondence analysis, and so on. Corpora can be imported from different sources and analysed using the importCorpusDlg function. The package provides flexible data source options to import corpora from different sources, such as text files, spreadsheet files, XML, HTML files, Alceste format and Twitter search. The Import function in this package processes the corpus and generates a term-document matrix. The package provides different functions to summarize and visualize the corpus statistics. Correspondence analysis and hierarchical clustering can be performed on the corpus. The corpusDissimilarity function helps analyse and create a crossdissimilarity table between term-documents present in the corpus. This package provides many functions to help the users explore the corpus. For example, frequentTerms to list the most frequent terms of a corpus, specificTerms to list terms most associated with each document, subsetCorpusByTermsDlg to create a subset of the corpus. Term frequency, term co-occurrence, term dictionary, temporal evolution of occurrences or term time series, term metadata variables, and corpus temporal evolution are among the other very useful functions available in this package for text mining. Download the package from CRAN page. tm The tm package is a text-mining framework which provides some powerful functions which will aid in text-processing steps. It has methods for importing data, handling corpus, metadata management, creation of term document matrices, and preprocessing methods. For managing documents using the tm package, we create a corpus which is a collection of text documents. There are two types of implementation, volatile corpus (VCorpus) and permanent corpus (PCropus). VCorpus is completely held in memory and when the R object is destroyed the corpus is gone. PCropus is stored in the filesystem and is present even after the R object is destroyed; this corpus can be created by using the VCorpus and PCorpus functions respectively. This package provides a few predefined sources which can be used to import text, such as DirSource, VectorSource, or DataframeSource. The getSources method lists available sources, and users can create their own sources. The tm package ships with several reader options: readPlain, readPDF, and readDOC. We can execute the getReaders method for an up-to-date list of available readers. To write a corpus to the filesystem, we can use writeCorpus. For inspecting a corpus, there are methods such as inspect and print. For transformation of text, such as stop-word removal, stemming, whitespace removal, and so on, we can use the tm_map, content_transformer, tolower, stopwords("english") functions. For metadata management, meta comes in handy. The tm package provides various quantitative function for text analysis, such as DocumentTermMatrix , findFreqTerms, findAssocs, and removeSparseTerms. Download the tm package here. languageR languageR provides data sets and functions for statistical analysis on text data. This package contains functions for vocabulary richness, vocabulary growth, frequency spectrum, also mixed-effects models and so on. There are simulation functions available: simple regression, quasi-F factor, and Latin-square designs. Apart from that, this package can also be used for correlation, collinearity diagnostic, diagnostic visualization of logistic models, and so on. koRpus The koRpus package is a versatile tool for text mining which implements many functions for text readability and lexical variation. Apart from that, it can also be used for basic level functions such as tokenization and POS tagging. You can find more information about its current version and dependencies here. RKEA The RKEA package provides an interface to KEA, which is a tool for keyword extraction from texts. RKEA requires a keyword extraction model, which can be created by manually indexing a small set of texts, using which it extracts keywords from the document. maxent The maxent package in R provides tools for low-memory implementation of multinomial logistic regression, which is also called the maximum entropy model. This package is quite helpful for classification processes involving sparse term-document matrices, and low memory consumption on huge datasets. Download and get started with maxent. lsa Truncated singular vector decomposition can help overcome the variability in a term-document matrix by deriving the latent features statistically. The lsa package in R provides an implementation of latent semantic analysis. The ease of use and efficiency of R packages can be very handy when carrying out even the trickiest of text mining task. As a result, they have grown to become very popular in the community. If you found this post useful, you should definitely refer to our book Mastering Text Mining with R. It will give you ample techniques for effective text mining and analytics using the above mentioned packages.
Read more
  • 0
  • 1
  • 32566

article-image-nips-2017-learning-state-representations-yael-niv
Amarabha Banerjee
18 Dec 2017
6 min read
Save for later

NIPS 2017 Special: Decoding the Human Brain for Artificial Intelligence to make smarter decisions

Amarabha Banerjee
18 Dec 2017
6 min read
Yael Niv is an Associate Professor of Psychology at the Princeton Neuroscience Institute since 2007. Her preferred areas of research include human and animal reinforcement learning and decision making. At her Niv lab, she studies day-to-day processes that animals and humans use to learn by trial and error, without explicit instructions given. In order to predict future events and to act upon the current environment so as to maximize reward and minimize the damage. Our article aims to deliver key points from Yael Niv’s keynote presentation at NIPS 2017. She talks about the ability of Artificial Intelligence systems to perform simple human-like tasks effectively using State representations in the human brain. The talk also deconstructs the complex human decision-making process. Further, we explore how a human brain breaks down complex procedures into simple states and how these states determine our decision-making capabilities.This, in turn, gives valuable insights into the design and architecture of smart AI systems with decision-making capabilities. Staying Simple is Complex What do you think happens when a human being crosses a road, especially when it’s a busy street and you constantly need to keep an eye on multiple checkpoints in order to be safe and sound? The answer is quite ironical. The human brain breaks down the complex process into multiple simple blocks. The blocks can be termed as states - and these states then determine decisions such as when to cross the road or at what speed to cross the road. In other words, the states can be anything - from determining the incoming traffic density to maintaining the calculation of your walking speed. These states help the brain to ignore other spurious or latent tasks in order to complete the priority task at hand. Hence, the computational power of the brain is optimized. The human brain possesses the capability to focus on the most important task at hand and then breaks it down into multiple simple tasks. The process of making smarter AI systems with complex decision-making capabilities can take inspiration from this process. The Practical Human Experiment To observe how the human brain behaves when urged to draw complex decisions, a few experiments were performed. The primary objective of these experiments was to verify the hypothesis that the decision making information in the human brain is stored in a part of the frontal brain called as Orbitofrontal cortex. The two experiments performed are described in brief below: Experiment 1 The participants were given sets of circles at random and they were asked to guess the number of circles in the cluster within 2 minutes. After they guessed the first time, the experimenter disclosed the correct number of circles. Then the subjects were further given a cluster of circles in two different colors (red and yellow) to repeat the guessing activity for each cluster. However, the experimenter never disclosed the fact that they will be given different colored clusters next. Observation: The most important observation derived from the experiment was that after the subject knew the correct count, their guesses revolved around that number irrespective of whether that count mattered for the next set of circle clusters given. That is, the count had actually changed for the two color specimens given to them. The important factor here is that the participants were not told that color would be a parameter to determine the number of circles in each set and still it played a huge part in guessing the number of circles in each set. This way it acted as a latent factor, which was present in the subconscious of the participants and was not a direct parameter. And, this being a latent factor was not in the list of parameters which played an important in determining the number of circles. But still, it played an important part in changing the overall count which was significantly higher for the red color than for the yellow color cluster. Hence, the experiment proved the hypothesis that latent factors are an integral part of intelligent decision-making capabilities in human beings. Experiment 2 The second experiment was performed to ascertain the hypothesis that the Orbitofrontal cortex contains all the data to help the human brain make complex decisions. For this, human brains were monitored using MRI to track the brain activity during the decision making process. In this experiment, the subjects were given a straight line and a dot. They were then asked to predict the next line from the dot - both in terms of line direction and its length. After completing this process for a given number of times, the participants were asked to remember the length and direction of the first line. There was a minor change among the sets of lines and dots. One group had a gradual change in line length and direction and another group had a drastic change in the middle. Observation: The results showed that the group with a gradual change of line length and direction were more helpful in preserving the first data and the one with drastic change was less accurate. The MRI reports showed signs that the classification information was primarily stored in the Orbitofrontal cortex. Hence it is considered as one of the most important parts of the human decision-making process. Shallow Learning with Deep Representations The decision-making capabilities and the effect of latent factors involved in it form the basis of dormant memory in humans. An experiment on rats was performed to explain this phenomenon. In the experiment, 4 rats were given electric shock accompanied by a particular type of sound for a day or two. On the third day, they reacted to the sound even without being given electric shocks. Ivan Pavlov has coined this term as Classical Conditioning theory wherein a relatively permanent change in behavior can be seen as a result of experience or continuous practice. Such instances of conditioning can be deeply damaging, for example in case of PTSD (Post Traumatic Stress Disorder) patients and other trauma victims. In order to understand the process of State representations being stored in memory, the reversal mechanism, i.e how to reverse the process also needs to be understood. For that, three techniques were tested on these rats: The rats were not given any shock but were subjected to the sound The rats were given shocks accompanied by sound at regular intervals and sounds without shock The shocks were slowly reduced in numbers but the sound continued The best results in reversing the memory were observed in case of the third technique, which is known as gradual extinction. In this way, a simple reinforcement learning mechanism is shown to be very effective because it helps in creating simple states which are manageable efficiently and trainable easily. Along with this, if we could extract information from brain imaging data derived from the Orbitofrontal cortex, these simple representational states can shed a lot of light into making complex computational processes simpler and enable us to make smarter AI systems for a better future.
Read more
  • 0
  • 0
  • 2070

article-image-neural-network-model-multi-layer-perceptrons-classifying-iris-flower-species
Sunith Shetty
16 Dec 2017
9 min read
Save for later

Deep Learning Algorithms: How to classify Irises using multi-layer perceptrons

Sunith Shetty
16 Dec 2017
9 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Big Data Analytics with Java by Rajat Mehta. Java is the de facto language for major big data environments like Hadoop, MapReduce etc. This book will teach you how to perform analytics on big data with production-friendly Java.[/box] From our below given post, we help you learn how to classify flower species from Iris dataset using multi-layer perceptrons. Code files are available for download towards the end of the post. Flower species classification using multi-layer perceptrons This is a simple hello world-style program for performing classification using multi-layer perceptrons. For this, we will be using the famous Iris dataset, which can be downloaded from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/Iris. This dataset has four types of datapoints, shown as follows: Attribute name Attribute description Petal Length Petal length in cm Petal Width Petal width in cm Sepal Length Sepal length in cm Sepal Width Sepal width in cm Class The type of iris flower that is Iris Setosa, Iris Versicolour, Iris Virginica This is a simple dataset with three types of Iris classes, as mentioned in the table. From the perspective of our neural network of perceptrons, we will be using the multi-perceptron algorithm bundled inside the spark ml library and will demonstrate how you can club it with the Spark-provided pipeline API for the easy manipulation of the machine learning workflow. We will also split our dataset into training and testing bundles so as to separately train our model on the training set and finally test its accuracy on the test set. Let's now jump into the code of this simple example. First, create the Spark configuration object. In our case, we also mention that the master is local as we are running it on our local machine: SparkConf sc = new SparkConf().setMaster("local[*]"); Next, build the SparkSession with this configuration and provide the name of the application; in our case, it is JavaMultilayerPerceptronClassifierExample: SparkSession spark = SparkSession  .builder()  .config(sc)  .appName("JavaMultilayerPerceptronClassifierExample")  .getOrCreate(); Next, provide the location of the iris dataset file: String path = "data/iris.csv"; Now load this dataset file into a Spark dataset object. As the file is in an csv format, we also specify the format of the file while reading it using the SparkSession object: Now load this dataset file into a Spark dataset object. As the file is in an csv format, we also specify the format of the file while reading it using the SparkSession object: Dataset<Row> dataFrame1 = spark.read().format("csv").load(path); After loading the data from the file into the dataset object, let's now extract this data from the dataset and put it into a Java class, IrisVO. This IrisVO class is a plain POJOand has the attributes to store the data point types, as shown: public class IrisVO { private Double sepalLength; private Double petalLength; private Double petalWidth; private Double sepalWidth; private String labelString; On the dataset object dataFrame1, we invoke the to JavaRDD method to convert it into an RDD object and then invoke the map function on it. The map function is linked to a lambda function, as shown. In the lambda function, we go over each row of the dataset and pull the data items from it and fill it in the IrisVO POJO object before finally returning this object from the lambda function. This way, we get a dataMap rdd object filled with IrisVO objects: JavaRDD<IrisVO> dataMap = dataFrame1.toJavaRDD().map( r -> {  IrisVO irisVO = new IrisVO();  irisVO.setLabelString(r.getString(5));  irisVO.setPetalLength(Double.parseDouble(r.getString(3)));  irisVO.setSepalLength(Double.parseDouble(r.getString(1)));  irisVO.setPetalWidth(Double.parseDouble(r.getString(4)));  irisVO.setSepalWidth(Double.parseDouble(r.getString(2)));  return irisVO; }); As we are using the latest Spark ML library for applying our machine learning algorithms from Spark, we need to convert this RDD back to a dataset. In this case, however, this dataset would have the schema for the individual data points as we had mapped them to the IrisVO object attribute types earlier: Dataset<Row> dataFrame = spark.createDataFrame(dataMap.rdd(), IrisVO. class); We will now split the dataset into two portions: one for training our multi-layer perceptron model and one for testing its accuracy later. For this, we are using the prebuilt randomSplit method available on the dataset object and will provide the parameters. We keep 70 percent for training and 30 percent for testing. The last entry is the 'seed' value supplied to the randomSplit method. Dataset<Row>[] splits = dataFrame.randomSplit(new double[]{0.7, 0.3}, 1234L); Next, we extract the splits into individual datasets for training and testing: Dataset<Row> train = splits[0]; Dataset<Row> test = splits[1]; Until now we had seen the code that was pretty much generic across most of the Spark machine learning implementations. Now we will get into the code that is specific to our multi-layer perceptron model. We will create an int array that will contain the count for the various attributes needed by our model: int[] layers = new int[] {4, 5, 4, 3}; Let's now look at the attribute types of this int array, as shown in the following table: Attribute value at array index Description 0 This is the number of neurons or perceptrons at the input layer of the network. This is the count of the number of features that are passed to the model. 1 This is a hidden layer containing five perceptrons (sigmoid neurons only, ignore the terminology). 2 This is another hidden layer containing four sigmoid neurons. 3 This is the number of neurons representing the output label classes. In our case, we have three types of Iris flowers, hence three classes. After creating the layers for the neural network and specifying the number of neurons in each layer, next build a StringIndexer class. Since our models are mathematical and look for mathematical inputs for their computations, we have to convert our string labels for classification (that is, Iris Setosa, Iris Versicolour, and Iris Virginica) into mathematical numbers. To do this, we use the StringIndexer class that is provided by Apache Spark. In the instance of this class, we also provide the place from where we can read the data for the label and the column where it will output the numerical representation for that label: StringIndexer labelIndexer = new StringIndexer(). setInputCol("labelString").setOutputCol("label"); Now we build the features array. These would be the features that we use when training our model: String[] featuresArr = {"sepalLength","sepalWidth","petalLength","pet alWidth"}; Next, we build a features vector as this needs to be fed to our model. To put the feature in vector form, we use the VectorAssembler class from the Spark ML library. We also provide a features array as input and provide the output column where the vector array will be printed: VectorAssembler va = new VectorAssembler().setInputCols(featuresArr). setOutputCol("features"); Now we build the multi-layer perceptron model that is bundled within the Spark ML library. To this model we supply the array of layers we created earlier. This layer array has the number of neurons (sigmoid neurons) that are needed in each layer of the multi-perceptron network: MultilayerPerceptronClassifier trainer = new MultilayerPerceptronClassifier()  .setLayers(layers)  .setBlockSize(128)  .setSeed(1234L)  .setMaxIter(25); The other parameters that are being passed to this multi-layer perceptron model are: Block Size Block size for putting input data in matrices for faster computation. The default value is 128. Seed Seed for weight initialization if weights are not set. Maximum iterations Maximum number of iterations to be performed on the dataset while learning. The default value is 100. Finally, we hook all the workflow pieces together using the pipeline API. To this pipeline API, we pass the different pieces of the workflow, that is, the labelindexer and vector assembler, and finally provide the model: Pipeline pipeline = new Pipeline().setStages(new PipelineStage[] {labelIndexer, va, trainer}); Once our pipeline object is ready, we fit the model on the training dataset to train our model on the underlying training data: PipelineModel model = pipeline.fit(train); Once the model is trained, it is not yet ready to be run on the test data to figure out its predictions. For this, we invoke the transform method on our model and store the result in a Dataset object: Dataset<Row> result = model.transform(test); Let's see the first few lines of this result by invoking a show method on it: result.show(); This would print the result of the first few lines of the result dataset as shown: As seen in the previous image, the last column depicts the predictions made by our model. After making the predictions, let's now check the accuracy of our model. For this, we will first select two columns in our model which represent the predicted label, as well as the actual label (recall that the actual label is the output of our StringIndexer): Dataset<Row> predictionAndLabels = result.select("prediction", "label"); Finally, we will use a standard class called MulticlassClassificationEvaluator, which is provided by Spark for checking the accuracy of the models. We will create an instance of this class. Next, we will set the metric name of the metric, that is, accuracy, for which we want to get the value from our predicted results: MulticlassClassificationEvaluator evaluator = new MulticlassClassificationEvaluator() .setMetricName("accuracy"); Next, using the instance of this evaluator, invoke the evaluate method and pass the parameter of the dataset that contains the column for the actual result and predicted result (in our case, it is the predictionAndLabels column): System.out.println("Test set accuracy = " + evaluator.evaluate(predictionAndLabels)); This would print the output as: If we get this value in a percentage, this means that our model is 95% accurate. This is the beauty of neural networks - they can give us very high accuracy when tweaked properly. With this, we come to an end for our small hello world-type program on multi-perceptrons. Unfortunately, Spark support on neural networks and deep learning is not extensive; at least not until now. To summarize, we covered a sample case study for the classification of Iris flower species based on the features that were used to train our neural network. If you are keen to know more about real-time analytics using deep learning methodologies such as neural networks and multi-layer perceptrons, you can refer to the book Big Data Analytics with Java. [box type="download" align="" class="" width=""]Download Code files[/box]      
Read more
  • 0
  • 0
  • 5916

article-image-nips-2017-deep-bayesian-bayesian-deep-learning-yee-whye-teh
Savia Lobo
15 Dec 2017
8 min read
Save for later

NIPS 2017 Special: A deep dive into Deep Bayesian and Bayesian Deep Learning with Yee Whye Teh

Savia Lobo
15 Dec 2017
8 min read
Yee Whye Teh is a professor at the department of Statistics of the University of Oxford and also a research scientist at DeepMind. He works on statistical machine learning, focussing on Bayesian nonparametrics, probabilistic learning, and deep learning. The motive of this article aims to bring our readers to Yee’s keynote speech at the NIPS 2017. Yee’s keynote ponders deeply on the interface between two perspectives on machine learning: Bayesian learning and Deep learning by exploring questions like: How can probabilistic thinking help us understand deep learning methods or lead us to interesting new methods? Conversely, how can deep learning technologies help us develop advanced probabilistic methods? For a more comprehensive and in-depth understanding of this novel approach, be sure to watch the complete keynote address by Yee Whye Teh on  NIPS facebook page. All images in this article come from Yee’s presentation slides and do not belong to us. The history of machine learning has shown a growth in both model complexity and in model flexibility. The theory led models have started to lose their shine. This is because machine learning is at the forefront of a revolution that could be called as data led models or the data revolution. As opposed to theory led models, data-led models try not to impose too many assumptions on the processes that have to be modeled and are rather superflexible non-parametric models that can capture the complexities but they require large amount of data to operate.   On the model flexibility side, we have various approaches that have been explored over the years. We have kernel methods, Gaussian processes, Bayesian nonparametrics and now we have deep learning as well. The community has also developed evermore complex frameworks both graphical and programmatic to compose large complex models from simpler building blocks. In the 90’s we had graphical models, later we had probabilistic programming systems, followed by deep learning systems like TensorFlow, Theano, and Torch. A recent addition is probabilistic Torch, which brings together ideas from both the probabilistic Bayesian learning and deep learning. On one hand we have Bayesian learning, which deals with learning as inference in some probabilistic models. On the other hand we have deep learning models, which view learning as optimization functions parametrized by neural networks. In recent years there has been an explosion of exciting research at this interface of these two popular approaches resulting in increasingly complex and exciting models. What is Bayesian theory of learning Bayesian learning describes an ideal learner as one who interacts with the world in order to know its state, which is given by θ. He/she makes some observations about the world by deducing a model in Bayesian context. This model is a joint distribution of both the unknown state of the world θ and the observation about the world x. The model consists of prior distribution and marginal distribution, combining which gives a reverse conditional distribution also known as posterior, which describes the totality of the agent's knowledge about the world after he/she sees x. This posterior can also be used for predicting future observations and act accordingly. Issues associated with Bayesian learning Rigidity Learning can be wrong if model is wrong Not all prior knowledge can be encoded as joint distribution Simple analytic forms are limiting for conditional distributions 2. Scalability: Intractable to compute this posterior and approximations have to be made, which then introduces trade offs between efficiency and accuracy. As a result, it is often assumed that Bayesian techniques are not scalable. To address these issues, the speaker highlights some of his recent projects which showcase scenarios where deep learning ideas are applied to Bayesian models (Deep Bayesian learning) or in the reverse applying Bayesian ideas to Neural Networks ( i.e. Bayesian Deep learning) Deep Bayesian learning: Deep learning assists Bayesian learning Deep learning can improve Bayesian learning in the following ways: Improve the modeling flexibility by using neural networks in the construction of Bayesian models Improve the inference and scalability of these methods by parameterizing the posterior way of using neural networks Empathizing inference over multiple runs These can be seen in the following projects showcased by Yee: Concrete VAEs(Variational Autoencoders) FIVO: Filtered Variational Objectives Concrete VAEs What are VAEs? All the qualities mentioned above, i.e. improving modeling flexibility, improving inference and scalability, and empathizing inference over multiple runs by using neural networks can be seen in a class of deep generative models known as VAE (Variational Autoencoders). Fig: Variational Autoencoders VAEs include latent variables that describe the contents of a scene i.e objects, pose. The relationship between these latent variables and the pixels have to be highly complex and nonlinear. So, in short, VAEs are used to parameterize generative and variable posterior distribution that allows for greater scope flexible modeling. The key that makes VAEs work is the reparameterization trick Fig: Adding reparameterization to VAEs The reparameterization trick is crucial to the continuous latent variables in the VAEs. But many models naturally include discrete latent variables. Yee suggests application of the reparameterization on the discrete latent variables as a work around. This brings us to the concept of Concrete VAEs.. CONtinuous relaxation of disCRETE distributions.Also, the density can be further calculated: This concrete distribution is the reparameterization trick for discrete variables which helps in calculating the KL divergence that is needed for variational inference. FIVO: Filtered Variational Objectives FIVO extends VAEs towards models for sequential and time series data. It is built upon another extension of VAEs known as Importance Weighted Autoencoder, a generative model with a similar as that of the VAE, but which uses a strictly tighter log-likelihood lower bound. Variational lower bound: Rederivation from importance sampling: Better to use multiple samples: Using Importance Weighted Autoencoders we can use multiple sampling, with which we can get a tighter lower bound and optimizing this lower bound should lead to better learning. Let’s have a look at the FIVO objectives: We can use any unbiased estimator p(X) of marginal probabilityTightness of bound related to variance of estimatorFor sequential models, we can use particle filters which produce unbiased estimator of marginal probability. They can also have much lower variance than importance samplers. Bayesian Deep learning: Bayesian approach for deep learning gives us counterintuitive and surprising ways to make deep learning scalable. In order to explore the potential of Bayesian learning with deep neural networks, Yee introduced a project named, The posterior server. The Posterior server The posterior server is a distributed server for deep learning. It makes use of the Bayesian approach in order to make neural networks highly scalable. This project focuses on Distributed learning, where both the data and the computations can be spread across the network. The figure above shows that there are a bunch of workers and each communicates with the parameter server, which effectively maintains the authoritative copy of the parameters of the network. At each iteration, each worker obtains the latest copy of the parameter from the server, computes the gradient update based on its data and sends it back to the server which then updates it to the authoritative copy. So, communications on the network tend to be slower than the computations that can be done on the network. Hence, one might consider multiple gradient steps on each iteration before it sends the accumulated update back to the parameter server. The problem is that the parameter and the worker quickly get out of sync with the authoritative copy on the parameter server. As a result, this leads to stale updates which allow noise into the system and we often need frequent synchronizations across the network for the algorithm to learn in a stable fashion. The main idea here in Bayesian context is that we don't just want a single parameter, we want a whole distribution over them. This will then relax the need for frequent synchronizations across the network and hopefully lead to algorithms that are robust to last frequent communication. Each worker is simply going to construct its own tractable approximation to his own likelihood function and send this information to the posterior server which then combines these approximations together to form the full posterior or an approximation of it. Further, the approximations that are constructed would be based on the statistics of some sampling algorithms that happens locally on that worker. The actual algorithm includes a combination of the variational algorithms, Stochastic Gradient EP and the Markov chain Monte Carlo on the workers themselves. So the variational part in the algorithm handles the communication part in the network whereas the MCMC part handles the sampling part that is posterior to construct the statistics that the variational part needs. For scalability, a stochastic gradient Langevin algorithm which is a simple generalization of the SGT, which includes additional injected noise, to sample from posterior noise. To experiment with this server, it was trained densely connected neural networks with 500 reLU units on MNIST dataset. You can have a detailed understanding of these examples in the keynote video. This interface between Bayesian learning and deep learning is a very exciting frontier. Researchers have brought management of uncertainties within deep learning. Also, flexibility and scalability in Bayesian modeling. Yee concludes with two questions for the audience to think about. Does being Bayesian in the space of functions makes more sense than being Bayesian in the sense of parameters? How to deal with uncertainties under model misspecification?    
Read more
  • 0
  • 0
  • 3611
article-image-how-google-mapreduce-works-big-data-projects
Sugandha Lahoti
15 Dec 2017
7 min read
Save for later

How Google's MapReduce works and why it matters for Big Data projects

Sugandha Lahoti
15 Dec 2017
7 min read
[box type="note" align="" class="" width=""]The article given below is a book extract from Java Data Analysis written by John R. Hubbard. The book will give you the most out of popular Java libraries and tools to perform efficient data analysis.[/box] In this article, we will explore Google’s MapReduce framework to analyze big data. How do you quickly sort a list of billion elements? Or multiply two matrices, each with a million rows and a million columns? In implementing their PageRank algorithm, Google quickly discovered the need for a systematic framework for processing massive datasets. That could be done only by distributing the data and the processing over many storage units and processors. Implementing a single algorithm, such as PageRank in that environment is difficult, and maintaining the implementation as the dataset grows is even more challenging. The solution: MapReduce framework The answer lay in separating the software into two levels: a framework that manages the big data access and parallel processing at a lower level, and a couple of user-written methods at an upper-level. The independent user who writes the two methods need not be concerned with the details of the big data management at the lower level. How does it function Specifically, the data flows through a sequence of stages: The input stage divides the input into chunks, usually 64MB or 128MB. The mapping stage applies a user-defined map() function that generates from one key-value pair a larger collection of key-value pairs of a different type. The partition/grouping stage applies hash sharding to those keys to group them. The reduction stage applies a user-defined reduce() function to apply some specific algorithm to the data in the value of each key-value pair. The output stage writes the output from the reduce() method. The user's choice of map() and reduce() methods determines the outcome of the entire process; hence the name MapReduce. This idea is a variation on the old algorithmic paradigm called divide and conquer. Think of the proto-typical mergesort, where an array is sorted by repeatedly dividing it into two halves until the pieces have only one element, and then they are systematically pairwise merged back together. MapReduce is actually a meta-algorithm—a framework, within which specific algorithms can be implemented through its map() and reduce() methods. Extremely powerful, it has been used to sort a petabyte of data in only a few hours. Recall that a petabyte is 10005 = 1015 bytes, which is a thousand terabytes or a million gigabytes. Some examples of MapReduce applications Here are a few examples of big data problems that can be solved with the MapReduce framework: Given a repository of text files, find the frequency of each word. This is called the WordCount problem. Given a repository of text files, find the number of words of each word length. Given two matrices in a sparse matrix format, compute their product. Factor a matrix given in sparse matrix format. Given a symmetric graph whose nodes represent people and edges represent friendship, compile a list of common friends. Given a symmetric graph whose nodes represent people and edges represent friendship, compute the average number of friends by age. Given a repository of weather records, find the annual global minima and maxima by year. Sort a large list. Note that in most implementations of the MapReduce framework, this problem is trivial, because the framework automatically sorts the output from the map() function. Reverse a graph. Find a minimal spanning tree (MST) of a given weighted graph. Join two large relational database tables. The WordCount example In this section, we present the MapReduce solution to the WordCount problem, sometimes called the Hello World example for MapReduce. The diagram in the figure below shows the data flow for the WordCount program. On the left are two of the 80 files that are read into the program: During the mapping stage, each word, followed by the number 1, is copied into a temporary file, one pair per line. Notice that many words are duplicated many times. For example, image appears five times among the 80 files (including both files shown), so the string image 1 will appear four times in the temporary file. Each of the input files has about 110 words, so over 8,000 word-number pairs will be written to the temporary file. Note that this figure shows only a very small part of the data involved. The output from the mapping stage includes every word that is input, as many times that it appears. And the output from the grouping stage includes every one of those words, but without duplication. The grouping process reads all the words from the temporary file into a key-value hash table, where the key is the word, and the value is a string of 1s, one for each occurrence of that word in the temporary file. Notice that these 1s written to the temporary file are not used. They are included simply because the MapReduce framework in general expects the map() function to generate key-value pairs.The reducing stage transcribed the contents of the hash table to an output file, replacing each string of 1s with the number of them. For example, the key-value pair ("book", "1 1 1 1")  is written as book 4 in the output file. Keep in mind that this is a toy example of the MapReduce process. The input consists of 80 text files containing about 9073 words. So, the temporary file has 9073 lines, with one word per line. Only 2149 of those words are distinct, so the hash table has 2149 entries and the output file has 2149 lines, with one word per line. The main idea So, this is the main idea of the MapReduce meta-algorithm: provide a framework for processing massive datasets, a framework that allows the independent programmer to plug in specialized map() and reduce() methods that actually implement the required particular algorithm. If that particular algorithm is to count words, then write the map() method to extract each individual word from a specified file and write the key-value pair (word, 1) to wherever the specified writer will put them, and write the reduce() method to take a key-value pair such as (word, 1 1 1 1) and return the corresponding key-value pair as (word, 4) to wherever its specified writer will put it. These two methods are completely localized—they simply operate on key-value pairs. And, they are completely independent of the size of the dataset. The diagram below illustrates the general flow of data through an application of the MapReduce framework: The original dataset could be in various forms and locations: a few files in a local directory, a large collection of files distributed over several nodes on the same cluster, a database on a database system (relational or NoSQL), or data sources available on the World Wide Web. The MapReduce controller then carries out these five tasks: Split the data into smaller datasets, each of which can be easily accessed on a single machine. Simultaneously (that is, in parallel), run a copy of the user-supplied map() method, one on each dataset, producing a set of key-value pairs in a temporary file on that local machine. Redistribute the datasets among the machines, so that all instances of each key are in the same dataset. This is typically done by hashing the keys. Simultaneously (in parallel), run a copy of the user-supplied reduce() method, one on each of the temporary files, producing one output file on each machine. Combine the output files into a single result. If the reduce() method also sorts its output, then this last step could also include merging those outputs. The genius of the MapReduce framework is that it separates the data management (moving, partitioning, grouping, sorting, and so on) from the data crunching (counting, averaging, maximizing, and so on). The former is done with no attention required by the user. The latter is done in parallel, separately in each node, by invoking the two user-supplied methods map() and reduce(). Essentially, the only obligation of the user is to devise the correct implementations of these two methods that will solve the given problem. As we mentioned earlier, these examples are presented mainly to elucidate how the MapReduce algorithm works. Real-world implementations would, however, use MongoDB or Hadoop frameworks. If you enjoyed this excerpt, check out the book Java Data Analysis to get an understanding of the various data analysis techniques, and how to implement them using Java.  
Read more
  • 0
  • 0
  • 5466

article-image-big-data-analysis-using-googles-pagerank
Sugandha Lahoti
14 Dec 2017
8 min read
Save for later

Getting started with big data analysis using Google's PageRank algorithm

Sugandha Lahoti
14 Dec 2017
8 min read
[box type="note" align="" class="" width=""]The article given below is a book excerpt from Java Data Analysis written by John R. Hubbard. Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the aim of discovering useful information. Java is one of the most popular languages to perform your data analysis tasks. This book will help you learn the tools and techniques in Java to conduct data analysis without any hassle. [/box] This post aims to help you learn how to analyse big data using Google’s PageRank algorithm. The term big data generally refers to algorithms for the storage, retrieval, and analysis of massive datasets that are too large to be managed by a single file server. Commercially, these algorithms were pioneered by Google, Google’s PageRank being one of them is considered in this article. Google PageRank algorithm Within a few years of the birth of the web in 1990, there were over a dozen search engines that users could use to search for information. Shortly after it was introduced in 1995, AltaVista became the most popular among them. These search engines would categorize web pages according to the topics that the pages themselves specified. But the problem with these early search engines was that unscrupulous web page writers used deceptive techniques to attract traffic to their pages. For example, a local rug-cleaning service might list "pizza" as a topic in their web page header, just to attract people looking to order a pizza for dinner. These and other tricks rendered early search engines nearly useless. To overcome the problem, various page ranking systems were attempted. The objective was to rank a page based upon its popularity among users who really did want to view its contents. One way to estimate that is to count how many other pages have a link to that page. For example, there might be 100,000 links to https://en.wikipedia.org/wiki/Renaissance, but only 100 to https://en.wikipedia.org/wiki/Ernest_Renan, so the former would be given a much higher rank than the latter. But simply counting the links to a page will not work either. For example, the rug-cleaning service could simply create 100 bogus web pages, each containing a link to the page they want users to view. In 1996, Larry Page and Sergey Brin, while students at Stanford University, invented their PageRank algorithm. It simulates the web itself, represented by a very large directed graph, in which each web page is represented by a node in the graph, and each page link is represented by a directed edge in the graph. The directed graph shown in the figure below could represent a very small network with the same properties: This has four nodes, representing four web pages, A, B, C, and D. The arrows connecting them represent page links. So, for example, page A has a link to each of the other three pages, but page B has a link only to A. To analyze this tiny network, we first identify its transition matrix, M : This square has 16 entries, mij, for 1 ≤ i ≤ 4 and 1 ≤ j ≤ 4. If we assume that a web crawler always picks a link at random to move from one page to another, then mij, equals the probability that it will move to node i from node j, (numbering the nodes A, B, C, and D as 1, 2, 3, and 4). So m12 = 1 means that if it's at node B, there's a 100% chance that it will move next to A. Similarly, m13 = m43 = ½ means that if it's at node C, there's a 50% chance of it moving to A and a 50% chance of it moving to D. Suppose a web crawler picks one of those four pages at random, and then moves to another page, once a minute, picking each link at random. After several hours, what percentage of the time will it have spent at each of the four pages? Here is a similar question. Suppose there are 1,000 web crawlers who obey that transition matrix as we've just described, and that 250 of them start at each of the four pages. After several hours, how many will be on each of the four pages? This process is called a Markov chain. It is a mathematical model that has many applications in physics, chemistry, computer science, queueing theory, economics, and even finance. The diagram in the above figure is called the state diagram for the process, and the nodes of the graph are called the states of the process. Once the state diagram is given, the meaning of the nodes (web pages, in this case) becomes irrelevant. Only the structure of the diagram defines the transition matrix M, and from that we can answer the question. A more general Markov chain would also specify transition probabilities between the nodes, instead of assuming that all transition choices are made at random. In that case, those transition probabilities become the non-zero entries of the M. A Markov chain is called irreducible if it is possible to get to any state from any other state. According to the mathematical theory of Markov chains, if the chain is irreducible, then we can compute the answer to the preceding question using the transition matrix. What we want is the steady state solution; that is, a distribution of crawlers that doesn't change. The crawlers themselves will change, but the number at each node will remain the same. To calculate the steady state solution mathematically, we first have to realize how to apply the transition matrix M. The fact is that if x = (x1 , x2 , x3 , x4 ) is the distribution of crawlers at one minute, and the next minute the distribution is y = (y1 , y2 , y3 , y4 ), then y = Mx , using matrix multiplication. So now, if x is the steady state solution for the Markov chain, then Mx = x. This vector equation gives us four scalar equations in four unknowns: One of these equations is redundant (linearly dependent). But we also know that x1 + x2 + x3 + x4 = 1, since x is a probability vector. So, we're back to four equations in four unknowns. The solution is: The point of that example is to show that we can compute the steady state solution to a static Markov chain by solving an n × n matrix equation, where n is the number of states. By static here, we mean that the transition probabilities mij do not change. Of course, that does not mean that we can mathematically compute the web. In the first place, n > 30,000,000,000,000 nodes! And in the second place, the web is certainly not static. Nevertheless, this analysis does give some insight about the web; and it clearly influenced the thinking of Larry Page and Sergey Brin when they invented the PageRank algorithm. The purpose of the PageRank algorithm is to rank the web pages according to some criteria that would resemble their importance, or at least their frequency of access. The original simple (pre-PageRank) idea was to count the number of links to each page and use something proportional to that count for the rank. Following that line of thought, we can imagine that, if x = (x1 , x2 ,..., xn )T is the page rank for the web (that is, if xj is the relative rank of page j and ∑xj = 1), then Mx = x, at least approximately. Another way to put that is that repeated applications of M to x should nudge x closer and closer to that (unattainable) steady state. That brings us (finally) to the PageRank formula: where ε is a very small positive constant, z is the vector of all 1s, and n is the number of nodes. The vector expression on the right defines the transformation function f which replaces a page rank estimate x with an improved page rank estimate. Repeated applications of this function gradually converge to the unknown steady state. Note that in the formula, f is a function of more than just x. There are really four inputs: x, M, ε , and n. Of course, x is being updated, so it changes with each iteration. But M, ε , and n change too. M is the transition matrix, n is the number of nodes, and ε is a coefficient that determines how much influence the z/n vector has. For example, if we set ε to 0.00005, then the formula becomes: This is how Google's PageRank algorithm can be utilized for the analysis of very large datasets. To learn how to implement this algorithm and various other machine learning algorithms for big data, data visualization, and more using Java, check out this book Java Data Analysis.  
Read more
  • 0
  • 0
  • 5574

article-image-how-to-build-a-cold-start-friendly-content-based-recommender-using-apache-spark-sql
Sunith Shetty
14 Dec 2017
12 min read
Save for later

How to build a cold-start friendly content-based recommender using Apache Spark SQL

Sunith Shetty
14 Dec 2017
12 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Big Data Analytics with Java written by Rajat Mehta. In this book, you will learn how to perform analytics on big data using Java, machine learning and other big data tools. [/box] From the below given article, you will learn how to create the content-based recommendation system using movielens dataset. Getting started with content-based recommendation systems In content-based recommendations, the recommendation systems check for similarity between the items based on their attributes or content and then propose those items to the end users. For example, if there is a movie and the recommendation system has to show similar movies to the users, then it might check for the attributes of the movie such as the director name, the actors in the movie, the genre of the movie, and so on or if there is a news website and the recommendation system has to show similar news then it might check for the presence of certain words within the news articles to build the similarity criteria. As such the recommendations are based on actual content whether in the form of tags, metadata, or content from the item itself (as in the case of news articles). Dataset For this article, we are using the very popular movielens dataset, which was collected by the GroupLens Research Project at the University of Minnesota. This dataset contains a list of movies that are rated by their customers. It can be downloaded from the site https:/grouplens.org/datasets/movielens. There are few files in this dataset, they are: u.item:This contains the information about the movies in the dataset. The attributes in this file are: Movie Id This is the ID of the movie Movie title This is the title of the movie Video release date This is the release date of the movie Genre (isAction, isComedy, isHorror) This is the genre of the movie. This is represented by multiple flats such as isAction, isComedy, isHorror, and so on and as such it is just a Boolean flag with 1 representing users like this genre and 0 representing user do not like this genre. u.data: This contains the data about the users rating for the movies that they have watched. As such there are three main attributes in this file and they are: Userid This is the ID of the user rating the movie Item id This is the ID of the movie Rating This is the rating given to the movie by the user u.user: This contains the demographic information about the users. The main attributes from this file are: User id This is the ID of the user rating the movie Age This is the age of the user Gender This is the rating given to the movie by the user ZipCode This is the zip code of the place of the user As the data is pretty clean and we are only dealing with a few parameters, we are not doing any extensive data exploration in this article for this dataset. However, we urge the users to try and practice data exploration on this dataset and try creating bar charts using the ratings given, and find patterns such as top-rated movies or top-rated action movies, or top comedy movies, and so on. Content-based recommender on MovieLens dataset We will build a simple content-based recommender using Apache Spark SQL and the MovieLens dataset. For figuring out the similarity between movies, we will use the Euclidean Distance. This Euclidean Distance would be run using the genre properties, that is, action, comedy, horror, and so on of the movie items and also run on the average rating for each movie. First is the usual boiler plate code to create the SparkSession object. For this we first build the Spark configuration object and provide the master which in our case is local since we are running this locally in our computer. Next using this Spark configuration object we build the SparkSession object and also provide the name of the application here and that is ContentBasedRecommender. SparkConf sc = new SparkConf().setMaster("local"); SparkSession spark = SparkSession      .builder()      .config(sc)      .appName("ContentBasedRecommender")      .getOrCreate(); Once the SparkSession is created, load the rating data from the u.data file. We are using the Spark RDD API for this, the user can feel free to change this code and use the dataset API if they prefer to use that. On this Java RDD of the data, we invoke a map function and within the map function code we extract the data from the dataset and populate the data into a Java POJO called RatingVO: JavaRDD<RatingVO> ratingsRDD = spark.read().textFile("data/movie/u.data").javaRDD() .map(row -> { RatingVO rvo = RatingVO.parseRating(row); return rvo; }); As you can see, the data from each row of the u.data dataset is passed to the RatingVO.parseRating method in the lambda function. This parseRating method is declared in the POJO RatingVO. Let's look at the RatingVO POJO first. For maintaining brevity of the code, we are just showing the first few lines of the POJO here: public class RatingVO implements Serializable {   private int userId; private int movieId; private float rating; private long timestamp;   private int like; As you can see in the preceding code we are storing movieId, rating given by the user and the userId attribute in this POJO. This class RatingVO contains one useful method parseRating as shown next. In this method, we parse the actual data row (that is tab separated) and extract the values of rating, movieId, and so on from it. In the method, do note the if...else block, it is here that we check whether the rating given is greater than three. If it is, then we take it as a 1 or like and populate it in the like attribute of RatingVO, else we mark it as 0 or dislike for the user: public static RatingVO parseRating(String str) {      String[] fields = str.split("t");      ...      int userId = Integer.parseInt(fields[0]);      int movieId = Integer.parseInt(fields[1]);      float rating = Float.parseFloat(fields[2]); if(rating > 3) return new RatingVO(userId, movieId, rating, timestamp,1);      return new RatingVO(userId, movieId, rating, timestamp,0); } Once we have the RDD object built with the POJO's RatingVO contained per row, we are now ready to build our beautiful dataset object. We will use the createDataFrame method on the spark session and provide the RDD containing data and the corresponding POJO class, that is, RatingVO. We also register this dataset as a temporary view called ratings so that we can fire SQL queries on it: Dataset<Row> ratings = spark.createDataFrame(ratingsRDD, RatingVO.class); ratings.createOrReplaceTempView("ratings"); ratings.show(); The show method in the preceding code would print the first few rows of this dataset as shown next: Next we will fire a Spark SQL query on this view (ratings) and in this SQL query we will find the average rating in this dataset for each movie. For this, we will group by on the movieId in this query and invoke an average function on the rating  column as shown here: Dataset<Row> moviesLikeCntDS = spark.sql("select movieId,avg(rating) likesCount from ratings group by movieId"); The moviesLikeCntDS dataset now contains the results of our group by query. Next we load the data for the movies from the u.item data file. As we did for users, we store this data for movies in a MovieVO POJO. This MovieVO POJO contains the data for the movies such as the MovieId, MovieTitle and it also stores the information about the movie genre such as action, comedy, animation, and so on. The genre information is stored as 1 or 0. For maintaining the brevity of the code, we are not showing the full code of the lambda function here: JavaRDD<MovieVO> movieRdd = spark.read().textFile("data/movie/u.item").javaRDD() .map(row -> {  String[] strs = row.split("|");  MovieVO mvo = new MovieVO();  mvo.setMovieId(strs[0]);        mvo.setMovieTitle(strs[1]);  ...  mvo.setAction(strs[6]); mvo.setAdventure(strs[7]);     ...  return mvo; }); As you can see in the preceding code we split the data row and from the split result, which is an array of strings, we extract our individual values and store in the MovieVO object. The results of this operation are stored in the movieRdd object, which is a Spark RDD. Next, we convert this RDD into a Spark dataset. To do so, we invoke the Spark createDataFrame function and provide our movieRdd to it and also supply the MovieVO POJO here. After creating the dataset for the movie RDD, we perform an important step here of combining our movie dataset with the moviesLikeCntDS dataset we created earlier (recall that moviesLikeCntDS dataset contains our movie ID and the average rating for that review in this movie dataset). We also register this new dataset as a temporary view so that we can fire Spark SQL queries on it: Dataset<Row> movieDS = spark.createDataFrame(movieRdd.rdd(), MovieVO.class).join(moviesLikeCntDS, "movieId"); movieDS.createOrReplaceTempView("movies"); Before we move further on this program, we will print the results of this new dataset. We will invoke the show method on this new dataset: movieDS.show(); This would print the results as (for brevity we are not showing the full columns in the screenshot): Now comes the turn for the meat of this content recommender program. Here we see the power of Spark SQL, we will now do a self join within a Spark SQL query to the temporary view movie. Here, we will make a combination of every movie to every other movie in the dataset except to itself. So if you have say three movies in the set as (movie1, movie2, movie3), this would result in the combinations (movie1, movie2), (movie1, movie3), (movie2, movie3), (movie2, movie1), (movie3, movie1), (movie3, movie2). You must have noticed by now that this query would produce duplicates as (movie1, movie2) is same as (movie2, movie1), so we will have to write separate code to remove those duplicates. But for now the code for fetching these combinations is shown as follows, for maintaining brevity we have not shown the full code: Dataset<Row> movieDataDS = spark.sql("select m.movieId movieId1,m.movieTitle movieTitle1,m.action action1,m.adventure adventure1, ... " + "m2.movieId movieId2,m2.movieTitle movieTitle2,m2.action action2,m2.adventure adventure2, ..." If you invoke show on this dataset and print the result, it would print a lot of columns and their first few values. We will show some of the values next (note that we show values for movie1 on the left-hand side and the corresponding movie2 on the right-hand side): Did you realize by now that on a big data dataset this operation is massive and requires extensive computing resources. Thankfully on Spark you can distribute this operation to run on multiple computer notes. For this, you can tweak the spark-submit job with the following parameters so as to extract the last bit of juice from all the clustered nodes: spark-submit executor-cores <Number of Cores> --num-executor <Number of Executors> The next step is the most important one in this content management program. We will run over the results of the previous query and from each row we will pull out the data for movie1 and movie2 and cross compare the attributes of both the movies and find the Euclidean Distance between them. This would show how similar both the movies are; the greater the distance, the less similar the movies are. As expected, we convert our dataset to an RDD and invoke a map function and within the lambda function for that map we go over all the dataset rows and from each row we extract the data and find the Euclidean Distance. For maintaining conciseness, we depict only a portion of the following code. Also note that we pull the information for both movies and store the calculated Euclidean Distance in a EuclidVO Java POJO object: JavaRDD<EuclidVO> euclidRdd = movieDataDS.javaRDD().map( row -> {    EuclidVO evo = new EuclidVO(); evo.setMovieId1(row.getString(0)); evo.setMovieTitle1(row.getString(1)); evo.setMovieTitle2(row.getString(22)); evo.setMovieId2(row.getString(21));    int action = Math.abs(Integer.parseInt(row.getString(2)) – Integer.parseInt(row.getString(23)) ); ... double likesCnt = Math.abs(row.getDouble(20) - row.getDouble(41)); double euclid = Math.sqrt(action * action + ... + likesCnt * likesCnt); evo.setEuclidDist(euclid); return evo; }); As you can see in the bold text in the preceding code, we are calculating the Euclid Distance and storing it in a variable. Next, we convert our RDD to a dataset object and again register it in a temporary view movieEuclids and now are ready to fire queries for our predictions: Dataset<Row> results = spark.createDataFrame(euclidRdd.rdd(), EuclidVO.class); results.createOrReplaceTempView("movieEuclids"); Finally, we are ready to make our predictions using this dataset. Let's see our first prediction; let's find the top 10 movies that are closer to Toy Story, and this movie has movieId of 1. We will fire a simple query on the view movieEuclids to find this: spark.sql("select * from movieEuclids where movieId1 = 1 order by euclidDist asc").show(20); As you can see in the preceding query, we order by Euclidean Distance in ascending order as lesser distance means more similarity. This would print the first 20 rows of the result as follows: As you can see in the preceding results, they are not bad for content recommender system with little to no historical data. As you can see the first few movies returned are also famous animation movies like Toy Story and they are Aladdin, Winnie the Pooh, Pinocchio, and so on. So our little content management system we saw earlier is relatively okay. You can do many more things to make it even better by trying out more properties to compare with and using a different similarity coefficient such as Pearson Coefficient, Jaccard Distance, and so on. From the article, we learned how content recommenders can be built on zero to no historical data. These recommenders are based on the attributes present on the item itself using which we figure out the similarity with other items and recommend them. To know more about data analysis, data visualization and machine learning techniques you can refer to the book Big Data Analytics with Java.  
Read more
  • 0
  • 0
  • 6685
article-image-nips-2017-special-machine-learning-genomics-bridging-gap-research-clinical-trial-success-brendan-frey
Sugandha Lahoti
14 Dec 2017
10 min read
Save for later

NIPS 2017 Special: How machine learning for genomics is bridging the gap between research and clinical trial success by Brendan Frey

Sugandha Lahoti
14 Dec 2017
10 min read
Brendan Frey is the founder and CEO of Deep Genomics. He is the professor of engineering and medicine at the University of Toronto. His major work focuses on using machine learning to model genome biology and understand genetic disorders. This article attempts to bring our readers to Brendan’s Keynote speech at NIPS 2017. It highlights how the human genome can be reprogrammed using Machine Learning and gives a glimpse into some of the significant work going on in this field. After reading this article, head over to the NIPS Facebook page for the complete keynote. All images in this article come from Brendan’s presentation slides and do not belong to us. 65% of people in their lifetime are at a risk of acquiring a disease with a genetic basis. 8 million births per year are estimated to have a serious genetic defect. According to the US healthcare system, the lifetime average cost of such a baby is 5M$ per child. These are just some statistics. If we also add the emotional component to this data, it gives us an alarming picture of the state of the healthcare industry today. According to a recent study,  investing in pharma is no longer as lucrative as it used to be in the 90s. Funding for this sector is dwindling, which serves as a barrier to drug discovery, trial, and deployment. All of these, in turn, add to the rising cost of healthcare. Better to stuff your money in a mattress than put it in a pharmaceutical company! Genomics as a field is rich in data. Experts in genomics strive to determine complete DNA sequences and perform genetic mapping to help understand a disease. However, the main problem confronting Genome Biology and Genomics is the inability to decipher information from the human genome i.e. how to convert the genome into actionable information. What genes are made of and why sequencing matter Essentially each gene consists of a promoter region, which basically activates the gene. Following the promoter region, there are alternating Exons and Introns. Introns are almost 10,000 nucleotides long. Exons are relatively short, around 100 nucleotides long. In software terms, you can think of Exons as print statements. Exons are the part that ends up in proteins. Introns get cut out/removed. However, Introns contain crucial control logic. There are words embedded in introns that tells the cells how to cut and paste these exons together and make the gene. A DNA sequence is transcribed into RNA and then the RNA is processed in various ways to translate into proteins. However, the picture is much more complicated than this. Proteins go back and interact with the DNA. Proteins also interact with RNA. even, RNA interacts with protein. So all these entities are interrelated. All these technicalities and interrelationships make biology generally complex for researchers or even a group of researchers to fully understand and make sense of the data. Another way to look at this is, that in the recent years our ability to measure biology (fitbits, tabloids, genomes) and the ability to alter biology (DNA editing) has far surpassed our ability to understand biology. In short, in this field, we have become very good at collecting data but not as good with interpreting it. Machine Learning brought to genomes Deep Genomics, is a genetic medicine company that uses an AI-driven platform to support geneticists, molecular biologists and chemists in the development of genetic therapies. In 2010, Deep Genomics used machine learning to understand how words embedded in introns control print statements splicing which puts exons into proteins. They also used machine learning to reverse engineer to infer those code words using datasets. Another deep genomic research project talks about Protein-DNA binding data. There are datasets which allow you to measure interactions between protein and DNA and understand how that works.  In this research, they took a dataset from Ray et al 2013 which consisted of 240,000 designed sequences and then evaluated which are the proteins that the sequence likes to stick to. Thus generating a big data matrix of proteins and designed sequences. The machine learning task here was to learn to take a sequence and predict whether the protein will bind to that sequence. How was this done? They took batches of data containing the designed sequences and fed it into a convolutional neural network. The CNN swept across those sequences to generate an intermediary representation.  This representation was then fed into different layers of convolutional pooling and fully connected layers produced the output. The output was then compared to the measurement (the data matrix of proteins and designed sequences described earlier ) and backpropagation was used to update the parameters. One of the challenges was figuring out the right metric. For this, they compared the measured binding affinity (how much protein sticks to the sequence) to the output of the neural network and determined the right cos function for producing a neural network that is useful in practice. Usecase One of the use cases of this neural network is to identify the pathological mutation and fix them. The above illustration is a sequence of the cholesterol gene. The researchers artificially in silico looked at every possible mutation in the promoter. So for each nucleotide, say if the nucleotide had a value of A, they switched it to G, C, and T and for each of those possibilities, they ran the entire promoter through a neural network and looked at its output. The neural network then predicted the mutations that will disrupt the protein binding. The heights of the letter showed the measured binding affinity i.e the output of the neural network.  The white box displays how much the mutation changed the output.  Pink or bright red was used in case of positive mutation, blue in case of negative mutation and white for no change. This map was then compared with known results to see the accuracy and also make predictions never seen before in a clinical trial. As shown in the image, Blues, which are the potential or known harmful mutations have correctly fallen in the white spaces. But there are some unknown mutations as well. Machine learning output such as this can help researchers narrow their focus on learning about new diseases and also in diagnosing existing ones and treating them. Another group of researchers used a neural network to figure out the 3D structure or the chromatin interaction structure of the DNA. The data used was a matrix form and showed how strongly two parts of a DNA are likely to interact. The researchers trained a multilayer convolutional network to take as input the raw DNA sequence and also a signal called chromatin accessibility( tells how available the DNA is) and fed it into CNN. The output of that system predicted the probability of contact which is crucial for gene expression. Deep genomics: Using AI to build a new universe of digital medicines The founding belief at Deep Genomics is that the future of medicine will rely on artificial intelligence, because biology is too complex for humans to understand. The goal of deep genomics is to build AI platform for detecting and treating genetic disease. Genome tools Genome processing tools are tools which help in identification of mutation, e.g DeepVariant. At deep genomics, the tool used is called genomic kit which is 20 to 800 times faster than other existing tools. Disease mechanism Prediction This is about figuring whether the disease mechanism is pathological or the mutation which simply changes hair color. Therapeutic Development Helping patients by providing them with better medicines. These are the basics of any drug development procedure. We start with patient genetic data and clinical mutations. Then we find the disease mechanism and figure the mechanism of action( the steps to remediate the problem). However, the disease mechanism and mechanism of action of a potential drug may not be the inverse of one another. The next step is to design a drug. With Digital medicines, if we know the mechanism of action that we are trying to achieve, and we have ML systems, like the ones described earlier, we can simulate the effects of modifying DNA or RNA. Thus we can, in silico design the compound we want to test. Next, we test the experimental work in the wet lab to actually see if it alters the way in which ML systems predicted. The next thing is toxicity or off-target effects. This evaluates if the compound is going to change some other part of the genome or has some unintended consequences. Next, we have clinical trials. In case of clinical trials, the biggest problems facing pharmaceutical companies is patient’s gratification. Then comes the marketing and distribution of that drug which is highly costly. This includes marketing strategies to convince people to buy those drugs, insurance companies to pay for them, and legal companies to deal with litigations. Here’s how long it took Ionis and Biogen, to develop Spinraza, which is a drug for curing Spinal Muscular Atrophy (SMA). It is the most effective drug for curing SMA. It has saved hundreds of lives already. However, it costs 750,000$ per child per year. Why does it cost so much? If we look at the timeline of the development of Spinraza, the initial period of testing was quite long. The goal of deep Genomics is to use ML to accelerate the research period of drugs such as Spinraza from 8 years down to a couple of years. They also aim to use AI to accelerate clinical trials, toxicity studies, and other aspects of drug development. The whole idea is to reduce the amount of time needed to develop the drug. Deep genomics uses AI to automate and accelerate each of these steps and make it fast and accurate. However, apart from AI, they also test compounds at their wet lab in human cells to see if they work. They also use Cloud Laboratory. At cloud lab, they upload a python script. Once uploaded, it specifies the experimental protocols and then robots conduct these experiments. These labs rapidly scale up the ability to do experiments, test compounds, and solve other problems. Earning trust of stakeholders One of the major issues ML systems face in the genomics industry is earning the trust of the stakeholders. These stakeholders include the patients, the physicians treating the patients, the insurance companies paying for the treatments, different technology providers, and the hospitals. Machine learning practitioners are also often criticized for producing black boxes, that are not open to interpretation. The way to gain this trust is to exactly figure out what these stakeholders need.  For this, machine learning systems need to explain the intermediary steps of a prediction. For instance, instead of directly recommending double mastectomy, the system says you have a mutation, the mutation is going to cause splicing to go wrong, leading to malfunctioning protein, which is likely to lead to breast cancer. The likelihood is x%. The road ahead Researchers at Deep Genomics pare currently working primarily on Project Saturn. The idea is to use a Machine learning system to scan a vast space of 69 billion molecules all in silico and identify about a thousand active compounds. Active compounds allow us to manipulate cell biology. Think about it as 1000 control switches which we can turn and twist to adjust what is going inside a cell, a toolkit for therapeutic development. They plan to have 3 compounds in clinical trials within the next 3 years.
Read more
  • 0
  • 0
  • 3248

article-image-how-to-customize-lines-and-markers-in-matplotlib-2-0
Sugandha Lahoti
13 Dec 2017
6 min read
Save for later

How to Customize lines and markers in Matplotlib 2.0

Sugandha Lahoti
13 Dec 2017
6 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by Allen Chi Shing Yu, Claire Yik Lok Chung, and Aldrin Kay Yuen Yim, titled Matplotlib 2.x By Example. The book illustrates methods and applications of various plot types through real world examples.[/box] In this post we demonstrate how you can manipulate Lines and Markers in Matplotlib 2.0. It covers steps to plot, customize, and adjust line graphs and markers. What are Lines and Markers Lines and markers are key components found among various plots. Many times, we may want to customize their appearance to better distinguish different datasets or for better or more consistent styling. Whereas markers are mainly used to show data, such as line plots and scatter plots, lines are involved in various components, such as grids, axes, and box outlines. Like text properties, we can easily apply similar settings for different line or marker objects with the same method. Lines Most lines in Matplotlib are drawn with the lines class, including the ones that display the data and those setting area boundaries. Their style can be adjusted by altering parameters in lines.Line2D. We usually set color, linestyle, and linewidth as keyword arguments. These can be written in shorthand as c, ls, and lw respectively. In the case of simple line graphs, these parameters can be parsed to the plt.plot() function: import numpy as np import matplotlib.pyplot as plt # Prepare a curve of square numbers x = np.linspace(0,200,100) # Prepare 100 evenly spaced numbers from # 0 to 200 y = x**2                                   # Prepare an array of y equals to x squared # Plot a curve of square numbers plt.plot(x,y,label = '$x^2$',c='burlywood',ls=('dashed'),lw=2) plt.legend() plt.show() With the preceding keyword arguments for line color, style, and weight, you get a woody dashed curve: Choosing dash patterns Whether a line will appear solid or with dashes is set by the keyword argument linestyle. There are a few simple patterns that can be set by the linestyle name or the corresponding shorthand. We can also define our own dash pattern: 'solid' or '-': Simple solid line (default) 'dashed' or '--': Dash strokes with equal spacing 'dashdot' or '-.': Alternate dashes and dots 'None', ' ', or '': No lines (offset, on-off-dash-seq): Customized dashes; we will demonstrate in the following advanced example Setting capstyle of dashes The cap of dashes can be rounded by setting the dash_capstyle parameter if we want to create a softer image such as in promotion: import numpy as np import matplotlib.pyplot as plt # Prepare 6 lines x = np.linspace(0,200,100) y1 = x*0.5 y2 = x y3 = x*2 y4 = x*3 y5 = x*4 y6 = x*5 # Plot lines with different dash cap styles plt.plot(x,y1,label = '0.5x', lw=5, ls=':',dash_capstyle='butt') plt.plot(x,y2,label = 'x', lw=5, ls='--',dash_capstyle='butt') plt.plot(x,y3,label = '2x', lw=5, ls=':',dash_capstyle='projecting') plt.plot(x,y4,label = '3x', lw=5, ls='--',dash_capstyle='projecting') plt.plot(x,y5,label = '4x', lw=5, ls=':',dash_capstyle='round') plt.plot(x,y6,label = '5x', lw=5, ls='--',dash_capstyle='round') plt.show() Looking closely, you can see that the top two lines are made up of rounded dashes. The middle two lines with projecting capstyle have closer spaced dashes than the lower two with butt one, given the same default spacing: Markers A marker is another type of important component for illustrating data, for example, in scatter plots, swarm plots, and time series plots. Choosing markers There are two groups of markers, unfilled markers and filled_markers. The full set of available markers can be found by calling Line2D.markers, which will output a dictionary of symbols and their corresponding marker style names. A subset of filled markers that gives more visual weight is under Line2D.filled_markers. Here are some of the most typical markers: 'o' : Circle 'x' : Cross  '+' : Plus sign 'P' : Filled plus sign 'D' : Filled diamond 'S' : Square '^' : Triangle Here is a scatter plot of random numbers to illustrate the various marker types: import numpy as np import matplotlib.pyplot as plt from matplotlib.lines import Line2D # Prepare 100 random numbers to plot x = np.random.rand(100) y = np.random.rand(100) # Prepare 100 random numbers within the range of the number of # available markers as index # Each random number will serve as the choice of marker of the # corresponding coordinates markerindex = np.random.randint(0, len(Line2D.markers), 100) # Plot all kinds of available markers at random coordinates # for each type of marker, plot a point at the above generated # random coordinates with the marker type for k, m in enumerate(Line2D.markers): i = (markerindex == k) plt.scatter(x[i], y[i], marker=m)         plt.show() The different markers suit different densities of data for better distinction of each point: Adjusting marker sizes We often want to change the marker sizes so as to make them clearer to read from a slideshow. Sometimes we need to adjust the markers to have a different numerical value of marker size to: import numpy as np import matplotlib.pyplot as plt import matplotlib.ticker as ticker # Prepare 5 lines x = np.linspace(0,20,10) y1 = x y2 = x*2 y3 = x*3 y4 = x*4 y5 = x*5 # Plot lines with different marker sizes plt.plot(x,y1,label = 'x', lw=2, marker='s', ms=10)     # square size 10 plt.plot(x,y2,label = '2x', lw=2, marker='^', ms=12) # triangle size 12 plt.plot(x,y3,label = '3x', lw=2, marker='o', ms=10) # circle size 10 plt.plot(x,y4,label = '4x', lw=2, marker='D', ms=8)     # diamond size 8 plt.plot(x,y5,label = '5x', lw=2, marker='P', ms=12) # filled plus sign # size 12 # get current axes and store it to ax ax = plt.gca() plt.show() After tuning the marker sizes, the different series look quite balanced: If all markers are set to have the same markersize value, the diamonds and squares may look heavier: Thus, we learned how to customize lines and markers in a Matplotlib plot for better visualization and styling. To know more about how to create and customize plots in Matplotlib, check out this book Matplotlib 2.x By Example.
Read more
  • 0
  • 0
  • 20363