Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-google-ai-releases-cirq-and-open-fermion-cirq-to-boost-quantum-computation

19 Jul 2018

3 min read

Google AI releases Cirq and Open Fermion-Cirq to boost Quantum computation

19 Jul 2018

Google AI Quantum team announced two releases at the First International Workshop on Quantum Software and Quantum Machine Learning(QSML) yesterday. Firstly the public alpha release of Cirq, an open source framework for NISQ computers. The second release is OpenFermion-Cirq, an example of a Cirq-based application enabling near-term algorithms. Noisy Intermediate Scale Quantum (NISQ) computers are devices including ~50 - 100 qubits and high fidelity quantum gates enhance the quantum algorithms such that they can understand the power that these machines uphold. However, quantum algorithms for the quantum computers have their limitations such as A poor mapping between the algorithms and the machines Also, some quantum processors have complex geometric constraints These and other nuances inevitably lead to wasted resources and faulty computations. Cirq comes as a great help for researchers here. It is focussed on near-term questions, which help researchers to understand whether NISQ quantum computers are capable of solving computational problems of practical importance. It is licensed under Apache 2 and is free to be either embedded or modified within any commercial or open source package. Cirq highlights With Cirq, researchers can write quantum algorithms for specific quantum processors. It provides a fine-tuned user control over quantum circuits by, specifying gate behavior using native gates, placing these gates appropriately on the device, and scheduling the timing of these gates within the constraints of the quantum hardware. Other features of Cirq include: Allows users to leverage the most out of NISQ architectures with optimized data structures to write and compile the quantum circuits. Supports running of the algorithms locally on a simulator Designed to easily integrate with future quantum hardware or larger simulators via the cloud. OpenFermion-Cirq highlights Google AI Quantum team also released OpenFermion-Cirq, which is an example of a CIrq-based application that enables the near-term algorithms. OpenFermion is a platform for developing quantum algorithms for chemistry problems. OpenFermion-Cirq extends the functionality of OpenFermion by providing routines and tools for using Cirq for compiling and composing circuits for quantum simulation algorithms. An instance of the OpenFermion-Cirq is, it can be used to easily build quantum variational algorithms for simulating properties of molecules and complex materials. While building Cirq, the Google AI Quantum team worked with early testers to gain feedback and insight into algorithm design for NISQ computers. Following are some instances of Cirq work resulting from the early adopters: Zapata Computing: simulation of a quantum autoencoder (example code, video tutorial) QC Ware: QAOA implementation and integration into QC Ware’s AQUA platform (example code, video tutorial) Quantum Benchmark: integration of True-Q software tools for assessing and extending hardware capabilities (video tutorial) Heisenberg Quantum Simulations: simulating the Anderson Model Cambridge Quantum Computing: integration of proprietary quantum compiler t|ket> (video tutorial) NASA: architecture-aware compiler based on temporal-planning for QAOA (slides) and simulator of quantum computers (slides) The team also announced that it is using Cirq to create circuits that run on Google’s Bristlecone processor. Their future plans include making the Bristlecone processor available in cloud with Cirq as the interface for users to write programs for this processor. To know more about both the releases, check out the GitHub repositories of each Cirq and OpenFermion-Cirq. Q# 101: Getting to know the basics of Microsoft’s new quantum computing language Google Bristlecone: A New Quantum processor by Google’s Quantum AI lab Quantum A.I. : An intelligent mix of Quantum+A.I.

0
0
3407

article-image-setting-up-an-ethereum-development-environment-tutorial

Packt Editorial Staff

18 Jul 2018

8 min read

How to set up an Ethereum development environment [Tutorial]

Packt Editorial Staff

18 Jul 2018

8 min read

There are various ways to develop Ethereum blockchain. We will look at the mainstream options in this article which are: Test networks How to setup Ethereum private net This tutorial is extracted from the book Mastering Blockchain - Second Edition written by Imran Bashir. There are multiple ways to develop smart contracts on Ethereum. The usual and sensible approach is to develop and test Ethereum smart contracts either on a local private net or a simulated environment, and then it can be deployed on a public testnet. After all the relevant tests are successful on public testnet, the contracts can then be deployed to the public mainnet. There are however variations in this process, and many developers opt to only develop and test contracts on locally simulated environments. Then deploy on public mainnet or their private production blockchain networks. Developing on a simulated environment and then deploying directly to a public network can lead to faster time to production. As setting up private networks may take longer compared to setting a local development environment with a blockchain simulator. Let's start with connecting to a test network. Ethereum connection on test networks The Ethereum Go client (https://geth.ethereum.org) Geth, can be connected to the test network using the following command: $ geth --testnet A sample output is shown in the following screenshot. The screenshot shows the type of the network chosen and various other pieces of information regarding the blockchain download: The output of the geth command connecting to Ethereum test net A blockchain explorer for testnet is located at https://ropsten.etherscan.io can be used to trace transactions and blocks on the Ethereum test network. There are other test networks available too, such as Frontier, Morden, Ropsten, and Rinkeby. Geth can be issued with a command-line flag to connect to the desired network: --testnet: Ropsten network: pre-configured proof-of-work test network --rinkeby: Rinkeby network: pre-configured proof-of-authority test network --networkid value: Network identifier (integer, 1=Frontier, 2=Morden (disused), 3=Ropsten, 4=Rinkeby) (default: 1) Now let us do some experiments with building a private network and then we will see how a contract can be deployed on this network using the Mist and command-line tools. Setting up a private net Private net allows the creation of an entirely new blockchain. This is different from testnet or mainnet in the sense that it uses its on-genesis block and network ID. In order to create private net, three components are needed: Network ID The Genesis File Data directory to store blockchain data. Even though the data directory is not strictly required to be mentioned, if there is more than one blockchain already active on the system, then the data directory should be specified so that a separate directory is used for the new blockchain. On the mainnet, the Geth Ethereum client is capable of discovering boot nodes by default as they are hardcoded in the Geth client, and connects automatically. But on a private net, Geth needs to be configured by specifying appropriate flags and configuration in order for it to be discoverable by other peers or to be able to discover other peers. We will see how this is achieved shortly. In addition to the previously mentioned three components, it is desirable that you disable node discovery so that other nodes on the internet cannot discover your private network and it is secure. If other networks happen to have the same genesis file and network ID, they may connect to your private net. The chance of having the same network ID and genesis block is very low, but, nevertheless, disabling node discovery is good practice, and is recommended. In the following section, all these parameters are discussed in detail with a practical example. Network ID Network ID can be any positive number except 1 and 3, which are already in use by Ethereum mainnet and testnet (Ropsten), respectively. Network ID 786 has been chosen for the example private network discussed later in this section. The genesis file The genesis file contains the necessary fields required for a custom genesis block. This is the first block in the network and does not point to any previous block. The Ethereum protocol performs checking in order to ensure that no other node on the internet can participate in the consensus mechanism unless they have the same genesis block. Chain ID is usually used as an identification of the network. A custom genesis file that will be used later in the example is shown here: { "nonce": "0x0000000000000042", "timestamp": "0x00", "parentHash": "0x0000000000000000000000000000000000000000000000000000000000000000", "extraData": "0x00", "gasLimit": "0x8000000", "difficulty": "0x0400", "mixhash": "0x0000000000000000000000000000000000000000000000000000000000000000", "coinbase": "0x3333333333333333333333333333333333333333", "alloc": { }, "config": { "chainId": 786, "homesteadBlock": 0, "eip155Block": 0, "eip158Block": 0 } } This file is saved as a text file with the JSON extension; for example, privategenesis.json. Optionally, Ether can be pre-allocated by specifying the beneficiary's addresses and the amount of Wei, but it is usually not necessary as being on the private network, Ether can be mined very quickly. In order to pre-allocate a section can be added to the genesis file, as shown here: "alloc": { "0xcf61d213faa9acadbf0d110e1397caf20445c58f ": { "balance": "100000" }, } Now let's see what each of these parameters mean. nonce: This is a 64-bit hash used to prove that PoW has been sufficiently completed. This works in combination with the mixhash parameter. timestamp: This is the Unix timestamp of the block. This is used to verify the sequence of the blocks and for difficulty adjustment. For example, if blocks are being generated too quickly that difficulty goes higher. parentHash: This is always zero being the genesis (first) block as there is no parent of the first block. extraData: This parameter allows a 32-bit arbitrary value to be saved with the block. gasLimit: This is the limit on the expenditure of gas per block. difficulty: This parameter is used to determine the mining target. It represents the difficulty level of the hash required to prove the PoW. mixhash: This is a 256-bit hash which works in combination with nonce to prove that sufficient amount of computational resources has been spent in order to complete the PoW requirements. coinbase: This is the 160-bit address where the mining reward is sent to as a result of successful mining. alloc: This parameter contains the list of pre-allocated wallets. The long hex digit is the account to which the balance is allocated. config: This section contains various configuration information defining chain ID, and blockchain hard fork block numbers. This parameter is not required to be used in private networks. Data directory This is the directory where the blockchain data for the private Ethereum network will be saved. For example, in the following example, it is ~/etherprivate/. In the Geth client, a nu mber of parameters are specified in order to launch, further fine-tune the configuration, and launch the private network. These flags are listed here. Flags and their meaning The following are the flags used with the Geth client: --nodiscover: This flag ensures that the node is not automatically discoverable if it happens to have the same genesis file and network ID. --maxpeers: This flag is used to specify the number of peers allowed to be connected to the private net. If it is set to 0, then no one will be able to connect, which might be desirable in a few scenarios, such as private testing. --rpc: This is used to enable the RPC interface in Geth. --rpcapi: This flag takes a list of APIs to be allowed as a parameter. For example, eth, web3 will enable the Eth and Web3 interface over RPC. --rpcport: This sets up the TCP RPC port; for example: 9999. --rpccorsdomain: This flag specifies the URL that is allowed to connect to the private Geth node and perform RPC operations. cors in --rpccorsdomain means cross-origin resource sharing. --port: This specifies the TCP port that will be used to listen to the incoming connections from other peers. --identity: This flag is a string that specifies the name of a private node. Static nodes If there is a need to connect to a specific set of peers, then these nodes can be added to a file where the chaindata and keystore files are saved. For example, in the ~/etherprivate/ directory. The filename should be static- nodes.json. This is valuable in a private network because this way the nodes can be discovered on a private network. An example of the JSON file is shown as follows: [ "enode:// 44352ede5b9e792e437c1c0431c1578ce3676a87e1f588434aff1299d30325c233c8d426fc5 7a25380481c8a36fb3be2787375e932fb4885885f6452f6efa77f@xxx.xxx.xxx.xxx:TCP_P ORT" ] Here, xxx is the public IP address and TCP_PORT can be any valid and available TCP port on the system. The long hex string is the node ID. To summarize, we explored Ethereum test networks and how-to setup private Ethereum networks. Learn about cryptography and cryptocurrencies from this book Mastering Blockchain - Second Edition, to build highly secure, decentralized applications and conduct trusted in-app transactions. Everything you need to know about Ethereum Will Ethereum eclipse Bitcoin? The trouble with Smart Contracts

0
0
4730

article-image-5-reasons-government-should-regulate-technology

Richard Gall

17 Jul 2018

6 min read

5 reasons government should regulate technology

Richard Gall

17 Jul 2018

6 min read

Microsoft's Brad Smith took the unprecedented move last week of calling for government to regulate facial recognition technology. In an industry that has resisted government intervention, it was a bold yet humble step. It was a way of saying "we can't deal with this on our own." There will certainly be people who disagree with Brad Smith. For some the entrepreneurial spirit that is central to tech and startup culture will only be stifled by regulation. But let's be realistic about where we are at the moment - the technology industry has never faced such a crisis of confidence and met with substantial public cynicism. Perhaps government regulation is precisely what we need to move forward. Here are 4 reasons why government should regulate technology. Regulation can restore accountability and rebuild trust in tech We've said it a lot in 2018, but there really is a significant trust deficit in technology at the moment. From Cambridge Analytica scandal to AI bias, software has been making headlines in a way it never has before. This only cultivates a culture of cynicism across the public. And with talk of automation and job losses, it paints a dark picture of the future. It's no wonder that TV series like Black Mirror have such a hold over the public imagination. Of course, when used properly, technology should simply help solve problems - whether that's better consumer tech or improved diagnoses in healthcare. The problem arises when we find that there our problem-solving innovations have unintended consequences. By regulating, government can begin to think through some of these unintended consequences. But more importantly, trust can only be rebuilt once there is some degree of accountability within the industry. Think back to Zuckerberg's Congressional hearing earlier this year - while the Facebook chief may have been sweating, the real takeaway was that his power and influence was ultimately untouchable. Whatever mistakes he's made were just part and parcel of moving fast and breaking things. An apology and a humble shrug might normally pass, but with regulation, things begin to get serious. Misusing user data? We've got a law for that. Potentially earning money from people who want to undermine western democracy? We've got a law for that. Read next: Is Facebook planning to spy on you through your mobile’s microphones? Government regulation will make the conversation around the uses and abuses of technology more public Too much conversation about how and why we build technology is happening in the wrong places. Well, not the wrong places, just not enough places. The biggest decisions about technology are largely made by some of the biggest companies on the planet. All the dreams about a new democratized and open world are all but gone, as the innovations around which we build our lives come from a handful of organizations that have both financial and cultural clout. As Brad Smith argues, tech companies like Microsoft, Google, and Amazon are not the place to be having conversations about the ethical implications of certain technologies. He argues that while it's important for private companies to take more responsibility, it's an "inadequate substitute for decision making by the public and its representatives in a democratic republic." He notes that the commercial dynamics are always going to twist conversations. Companies, after all, are answerable to shareholders - only governments are accountable to the public. By regulating, the decisions we make (or don't make) about technology immediately enter into public discourse about the kind of societies we want to live in. Citizens can be better protected by tech regulation... At present, technology often advances in spite of, not because of, people. For all the talk of human-centered design, putting the customer first, every company that builds software is interested in one thing: making money. AI in particular can be dangerous for citizens For example, according to a ProPublica investigation, AI has been used to predict future crimes in the justice system. That's frightening in itself, of course, but it's particularly terrifying when you consider that criminality was falsely predicted at twice the times for black people as white people. Even in the context of social media filters, in which machine learning serves content based on a user's behavior and profile presents dangers to citizens. It gives rise to fake news and dubious political campaigning, making citizens more vulnerable to extreme - and false - ideas. By properly regulating this technology we should immediately have more transparency over how these systems work. This transparency would not only lead to more accountability in how they are built, it also ensures that changes can be made when necessary. Read next: A quick look at E.U.’s pending antitrust case against Google’s Android ...Software engineers need protection too One group haven't really been talked about when it comes to government regulation - the people actually building the software. This a big problem. If we're talking about the ethics of AI, software engineers building software are left in a vulnerable position. This is because the lines of accountability are blurred. Without a government framework that supports ethical software decision making, engineers are left in limbo. With more support for software engineers from government, they can be more confident in challenging decisions from their employers. We need to have a debate about who's responsible for the ethics of code that's written into applications today - is it the engineer? The product manager? Or the organization itself? That isn't going to be easy to answer, but some government regulation or guidance would be a good place to begin. Regulation can bridge the gap between entrepreneurs, engineers and lawmakers Times change. Years ago, technology was deployed by lawmakers as a means of control, production or exploration. That's why the military was involved with many of the innovations of the mid-twentieth century. Today, the gap couldn't be bigger. Lawmakers barely understand encryption, let alone how algorithms work. But there is also naivety in the business world too. With a little more political nous and even critical thinking, perhaps Mark Zuckerberg could have predicted the Cambridge Analytica scandal. Maybe Elon Musk would be a little more humble in the face of a coordinated rescue mission. There's clearly a problem - on the one hand, some people don't know what's already possible. For others, it's impossible to consider that something that is possible could have unintended consequences. By regulating technology, everyone will have to get to know one another. Government will need to delve deeper into the field, and entrepreneurs and engineers will need to learn more about how regulation may affect them. To some extent, this will have to be the first thing we do - develop a shared language. It might also be the hardest thing to do, too.

0
0
10133

article-image-openai-gym-environments-wrappers-and-monitors-tutorial

Packt Editorial Staff

17 Jul 2018

9 min read

Extending OpenAI Gym environments with Wrappers and Monitors [Tutorial]

Packt Editorial Staff

17 Jul 2018

9 min read

In this article we are going to discuss two OpenAI Gym functionalities; Wrappers and Monitors. These functionalities are present in OpenAI to make your life easier and your codes cleaner. It provides you these convenient frameworks to extend the functionality of your existing environment in a modular way and get familiar with an agent's activity. So, let's take a quick overview of these classes. This article is an extract taken from the book, Deep Reinforcement Learning Hands-On, Second Edition written by, Maxim Lapan. What are Wrappers? Very frequently, you will want to extend the environment's functionality in some generic way. For example, an environment gives you some observations, but you want to accumulate them in some buffer and provide to the agent the N last observations, which is a common scenario for dynamic computer games, when one single frame is just not enough to get full information about the game state. Another example is when you want to be able to crop or preprocess an image's pixels to make it more convenient for the agent to digest, or if you want to normalize reward scores somehow. There are many such situations which have the same structure: you'd like to “wrap” the existing environment and add some extra logic doing something. Gym provides you with a convenient framework for these situations, called a Wrapper class. How does a wrapper work? The class structure is shown on the following diagram. The Wrapper class inherits the Env class. Its constructor accepts the only argument: the instance of the Env class to be “wrapped”. To add extra functionality, you need to redefine the methods you want to extend like step() or reset(). The only requirement is to call the original method of the superclass. Figure 1: The hierarchy of Wrapper classes in Gym. To handle more specific requirements, like a Wrapper which wants to process only observations from the environment, or only actions, there are subclasses of Wrapper which allow filtering of only a specific portion of information. They are: ObservationWrapper: You need to redefine its observation(obs) method. Argument obs is an observation from the wrapped environment, and this method should return the observation which will be given to the agent. RewardWrapper: Exposes the method reward(rew), which could modify the reward value given to the agent. ActionWrapper: You need to override the method action(act) which could tweak the action passed to the wrapped environment to the agent. Now let’s implement some wrappers To make it slightly more practical, let's imagine a situation where we want to intervene in the stream of actions sent by the agent and, with a probability of 10%, replace the current action with random one. By issuing the random actions, we make our agent explore the environment and from time to time drift away from the beaten track of its policy. This is an easy thing to do using the ActionWrapper class. import gym from typing import TypeVar import random Action = TypeVar('Action') class RandomActionWrapper(gym.ActionWrapper): def __init__(self, env, epsilon=0.1): super(RandomActionWrapper, self).__init__(env) self.epsilon = epsilon Here we initialize our wrapper by calling a parent's __init__ method and saving epsilon (a probability of a random action). def action(self, action): if random.random() < self.epsilon: print("Random!") return self.env.action_space.sample() return action This is a method that we need to override from a parent's class to tweak the agent's actions. Every time we roll the die, with the probability of epsilon, we sample a random action from the action space and return it instead of the action the agent has sent to us. Please note, by using action_space and wrapper abstractions, we were able to write abstract code which will work with any environment from the Gym. Additionally, we print the message every time we replace the action, just to check that our wrapper is working. In production code, of course, this won't be necessary. if __name__ == "__main__": env = RandomActionWrapper(gym.make("CartPole-v0")) Now it's time to apply our wrapper. We create a normal CartPole environment and pass it to our wrapper constructor. From here on we use our wrapper as a normal Env instance, instead of the original CartPole. As the Wrapper class inherits the Env class and exposes the same interface, we can nest our wrappers in any combination we want. This is a powerful, elegant and generic solution: obs = env.reset() total_reward = 0.0 while True: obs, reward, done, _ = env.step(0) total_reward += reward if done: break print("Reward got: %.2f" % total_reward) Here is almost the same code, except that every time we issue the same action: 0. Our agent is dull and always does the same thing. By running the code, you should see that the wrapper is indeed working: rl_book_samples/ch02$ python 03_random_actionwrapper.py WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype. Random! Random! Random! Random! Reward got: 12.00 If you want, you can play with the epsilon parameter on the wrapper's creation and check that randomness improves the agent's score on average. We should move on and look at another interesting gem hidden inside Gym: Monitor. What is a Monitor? Another class you should be aware of is Monitor. It is implemented like Wrapper and can write information about your agent's performance in a file with optional video recording of your agent in action. Some time ago, it was possible to upload the result of Monitor class' recording to the https://gym.openai.com website and see your agent's position in comparison to other people's results (see thee following screenshot), but, unfortunately, at the end of August 2017, OpenAI decided to shut down this upload functionality and froze all the results. There are several activities to implement an alternative to the original website, but they are not ready yet. I hope this situation will be resolved soon, but at the time of writing it's not possible to check your result against those of others. Just to give you an idea of how the Gym web interface looked, here is the CartPole environment leaderboard: Figure 2: OpenAI Gym web interface with CartPole submissions Every submission in the web interface had details about training dynamics. For example, below is the author's solution for one of Doom's mini-games: Figure 3: Submission dynamics on the DoomDefendLine environment. Despite this, Monitor is still useful, as you can take a look at your agent's life inside the environment. How to add Monitor to your agent So, here is how we add Monitor to our random CartPole agent, which is the only difference (the whole code is in Chapter02/04_cartpole_random_monitor.py). if __name__ == "__main__": env = gym.make("CartPole-v0") env = gym.wrappers.Monitor(env, "recording") The second argument we're passing to Monitor is the name of the directory it will write the results to. This directory shouldn't exist, otherwise your program will fail with an exception (to overcome this, you could either remove the existing directory or pass the force=True argument to Monitor class' constructor). The Monitor class requires the FFmpeg utility to be present on the system, which is used to convert captured observations into an output video file. This utility must be available, otherwise Monitor will raise an exception. The easiest way to install FFmpeg is by using your system's package manager, which is OS distribution-specific. To start this example, one of three extra prerequisites should be met: The code should be run in an X11 session with the OpenGL extension (GLX) The code should be started in an Xvfb virtual display You can use X11 forwarding in ssh connection The cause of this is video recording, which is done by taking screenshots of the window drawn by the environment. Some of the environment uses OpenGL to draw its picture, so the graphical mode with OpenGL needs to be present. This could be a problem for a virtual machine in the cloud, which physically doesn't have a monitor and graphical interface running. To overcome this, there is a special “virtual” graphical display, called Xvfb (X11 virtual framebuffer), which basically starts a virtual graphical display on the server and forces the program to draw inside it. That would be enough to make Monitor happily create the desired videos. To start your program in the Xvbf environment, you need to have it installed on your machine (it usually requires installing the package xvfb) and run the special script xvfb-run: $ xvfb-run -s "-screen 0 640x480x24" python 04_cartpole_random_monitor.py [2017-09-22 12:22:23,446] Making new env: CartPole-v0 [2017-09-22 12:22:23,451] Creating monitor directory recording [2017-09-22 12:22:23,570] Starting new video recorder writing to recording/openaigym.video.0.31179.video000000.mp4 Episode done in 14 steps, total reward 14.00 [2017-09-22 12:22:26,290] Finished writing results. You can upload them to the scoreboard via gym.upload('recording') As you may see from the log above, video has been written successfully, so you can peek inside one of your agent's sections by playing it. Another way to record your agent's actions is using ssh X11 forwarding, which uses ssh ability to tunnel X11 communications between the X11 client (Python code which wants to display some graphical information) and X11 server (software which knows how to display this information and has access to your physical display). In X11 architecture, the client and the server are separated and can work on different machines. To use this approach, you need the following: X11 server running on your local machine. Linux comes with X11 server as a standard component (all desktop environments are using X11). On a Windows machine you can set up third-party X11 implementations like open source VcXsrv (available in https://sourceforge.net/projects/vcxsrv/). The ability to log into your remote machine via ssh, passing –X command line option: ssh –X servername. This enables X11 tunneling and allows all processes started in this session to use your local display for graphics output. Then you can start a program which uses Monitor class and it will display the agent's actions, capturing the images into a video file. To summarize, we discussed the two extra functionalities in an OpenAI Gym; Wrappers and Monitors. To solve complex real world problems in Deep Learning, grab this practical guide Deep Reinforcement Learning Hands-On, Second Edition today. How Reinforcement Learning works How to implement Reinforcement Learning with TensorFlow Top 5 tools for reinforcement learning

0
0
29839

article-image-the-software-behind-silicon-valley-emmy-nominated-not-hotdog-app

Sugandha Lahoti

16 Jul 2018

4 min read

The software behind Silicon Valley’s Emmy-nominated 'Not Hotdog' app

Sugandha Lahoti

16 Jul 2018

4 min read

This is a great news for all Silicon Valley Fans. The amazing Not Hotdog A.I. app shown on season 4’s 4th episode, has been nominated for a Primetime Emmy Award. The Emmys has placed Silicon Valley and the app in the category “Outstanding Creative Achievement In Interactive Media Within a Scripted Program” among other popular shows. Other nominations include 13 Reasons Why for “Talk To The Reasons”, a website that lets you chat with the characters. Rick and Morty, for “Virtual Rick-ality”, a virtual reality game. Mr. Robot, for "Ecoin", a fictional Global Digital Currency. And Westworld for “Chaos Takes Control Interactive Experience”, an online experience for promoting the show’s second season. Within a day of its launch, the ‘Not Hotdog’ application was trending on the App Store and on Twitter, grabbing the #1 spot on both Hacker News & Product Hunt, and won a Webby for Best Use of Machine Learning. The app uses state-of-the-art deep learning, with a mix of React Native, Tensorflow & Keras. It has averaged 99.9% crash-free users with a 4.5+/5 rating on the app stores. The ‘Not Hotdog’ app does what the name suggests. It identifies hotdogs — and not hot dogs. It is available for both Android and iOS devices whose description reads “What would you say if I told you there is an app on the market that tell you if you have a hotdog or not a hotdog. It is very good and I do not want to work on it any more. You can hire someone else.” How the Not Hotdog app is built The creator Tim Anglade uses sophisticated neural architecture for the Silicon Valley A.I. app that runs directly on your phone and trained it with Tensorflow, Keras & Nvidia GPUs. Of course, the use case is not very useful, but the overall app is a substantial example of deep learning and edge computing in pop culture. The app provides better privacy as images never leave a user’s device. Consequently, users are provided with a faster experience and offline availability as processing doesn’t go to the cloud. Using a no cloud-based AI approach means that the company can run the app at zero cost, providing significant savings, even under a load of millions of users. What is amazing about the app is that it was built by a single creator with limited resources ( a single laptop and GPU, using hand-curated data). This talks lengths of how much can be achieved even with a limited amount of time and resources, by non-technical companies, individual developers, and hobbyists alike. The initial prototype of the app was built using Google Cloud Platform’s Vision API, and React Native. React Native is a good choice as it supports many devices. The Google Cloud’s Vision API, however, was quickly abandoned. Instead, what was brought into the picture was Edge Computing. It enabled training the neural network directly on the laptop, to be exported and embedded directly into the mobile app, making the neural network execution phase run directly inside the user’s phone. How TensorFlow powers the Not Hotdog app After React Native, the second part of their tech stack was TensorFlow. They used the TensorFlow’s Transfer Learning script, to retrain the Inception architecture which helps in dealing with a more specific image problem. Transfer Learning helped them get better results much faster, and with less data compared to training from scratch. Inception turned out too big to be retrained. So, at the suggestion of Jeremy P. Howard, they explored and settled down on SqueezeNet. It provided explicit positioning as a solution for embedded deep learning, and the availability of a pre-trained Keras model on GitHub. The final architecture was largely based on Google’s MobileNets paper, which provided their neural architecture with Inception-like accuracy on simple problems, with only almost 4M parameters. YouTube has a $25 million plan to counter fake news and misinformation Microsoft’s Brad Smith calls for facial recognition technology to be regulated Too weird for Wall Street: Broadcom’s value drops after purchasing CA Technologies

0
0
4304

article-image-tensorflow-1-9-is-now-generally-available

Savia Lobo

11 Jul 2018

3 min read

Tensorflow 1.9 is now generally available

Savia Lobo

11 Jul 2018

3 min read

After the back-to-back release of Tensorflow 1.9 release candidates, rc-0, rc-1, and rc-2, the final version TensorFlow 1.9 is out and generally available. Key highlights of this version include support for gradient boosted trees estimators, new keras layers to speed up GRU and LSTM implementations and tfe.Network deprecation. It also includes improved functions for supporting data loading, text processing and pre-made estimators. Tensorflow 1.9 major features and improvements As mentioned in Tensorflow 1.9 rc-2, new Keras-based get started page and programmers guide page in the tf.Keras have been updated. The tf.Keras has been updated to Keras 2.1.6 API. One should try the newly added tf.keras.layers.CuDNNGRU, used for a faster GRU implementation and tf.keras.layers.CuDNNLSTM layers, which allows faster LSTM implementation. Both these layers are backed by cuDNN( NVIDIA CUDA Deep Neural Network library (cuDNN)). Gradient boosted trees estimators, a non-parametric statistical learning technique for classification and regression, are now supported by core feature columns and losses. Also, the python interface for the TFLite Optimizing Converter has been expanded, and the command line interface (AKA: toco, tflite_convert) is once again included in the standard pip installation. The distributions.Bijector API in the TF version 1.9 also supports broadcasting for Bijectors with the new API changes. Tensorflow 1.9 also includes improved data-loading and text processing with tf.decode_compressed, tf.string_strip, and Tf.strings.regex_full_match. It also has an added experimental support for new pre-made estimators like tf.contrib.estimator.BaselineEstimator, tf.contrib.estimator.RNNClassifier, tf.contrib.estimator.RNNEstimator. This version includes two breaking changes. Firstly for opening up empty variable scopes one can replace variable_scope('', ...) by variable_scope(tf.get_variable_scope(), ...), which is used to get the current scope of the variable. And the second breakthrough change is, headers used for building custom ops have been moved to a different file path. From site-packages/external to site-packages/tensorflow/include/external. Some bug fixes and other changes include: The tfe.Network has been deprecated Layered variable names have changed in the following conditions: Using tf.keras.layers with custom variable scopes. Using tf.layers in a subclassed tf.keras.Model class. Added the ability to pause recording operations for gradient computation via tf.GradientTape.stop_recording in the Eager execution and updated its documentation and introductory notebooks. Fixed an issue in which the TensorBoard Debugger Plugin, which could not handle total source file size exceeding gRPC message size limit (4 MB). Added GCS Configuration Ops and complex128 support to FFT, FFT2D, FFT3D, IFFT, IFFT2D, and IFFT3D. Conv3D, Conv3DBackpropInput, Conv3DBackpropFilter now supports arbitrary. Prevents tf.gradients() from backpropagating through integer tensors. LinearOperator[1D,2D,3D]Circulant added to tensorflow.linalg. To know more about the other changes, visit TensorFlow 1.9 release notes on GitHub. Create a TensorFlow LSTM that writes stories [Tutorial] Build and train an RNN chatbot using TensorFlow [Tutorial] Use TensorFlow and NLP to detect duplicate Quora questions [Tutorial]

0
0
4247

article-image-tensorflow-lstm-that-writes-stories-tutorial

Packt Editorial Staff

09 Jul 2018

18 min read

Create a TensorFlow LSTM that writes stories [Tutorial]

Packt Editorial Staff

09 Jul 2018

18 min read

0
2
5763

article-image-reinforcement-learning-mdp-markov-decision-process-tutorial

Fatema Patrawala

09 Jul 2018

11 min read

Implement Reinforcement learning using Markov Decision Process [Tutorial]

Fatema Patrawala

09 Jul 2018

11 min read

0
0
7160

article-image-optimize-hbase-for-the-cloud-tutorial

Natasha Mathur

06 Jul 2018

8 min read

How to optimize Hbase for the Cloud [Tutorial]

Natasha Mathur

06 Jul 2018

8 min read

Hadoop/Hbase was designed to crunch a huge amount of data in a batch mode and provide meaningful results to this data. This article is an excerpt taken from the book ‘HBase High Performance Cookbook’ written by Ruchir Choudhry. This book provides a solid understanding of the HBase basics. However, as the technology evolved over the years, the original architecture was fine tuned to move from the world of big-iron to the choice of cloud Infrastructure: It provides optimum pricing for the provisioning of new hardware, storage, and monitoring the infrastructure. One-click setup of additional nodes and storage. Elastic load-balancing to different clusters within the Hbase ecosystem. Ability to resize the cluster on-demand. Share capacity with different time-zones, for example, doing batch jobs in different data centers to and real-time analytics near to the customer. Easy integration with other Cloud-based services. HBase on Amazon EMR provides the ability to back up your HBase data directly to Amazon Simple Storage Service (Amazon S3). You can also restore from a previously created backup when launching an HBase cluster. Configuring Hbase for the Cloud Before we start, let's take a quick look at the supported versions and the prerequisites you need to move ahead. The list of supported versions is as below: Hbase Version - 0.94.18 & 0.94 AMI Versions - 3.1.0 and later & 3.0-3.04 AWS CLI configuration parameters - -ami-version 3.1 -ami-version 3.2 -ami-version 3.3 --applications Name=Hbase --ami-version 3.0 --application name=Hbase --ami-version 2.2 or later --applications Name=HBase Hbase Version details - Bug fixes Now let's look at the prerequisites: At least two instances (Optional): The cluster's master node runs the HBase master server and Zookeeper, and slave nodes run the HBase region servers. For optimum performance and production systems, HBase clusters should run on at least two EC2 instances, but you can run HBase on a single node for evaluation Purposes. Long-running clusters: HBase only runs on long-running clusters. By default, the CLI and Amazon EMR console create long-running clusters. An Amazon EC2 key pair set (Recommended): To use the Secure Shell (SSH) network protocol to connect with the master node and run HBase shell commands, you must use an Amazon EC2 key pair when you create the cluster. The correct AMI and Hadoop versions: HBase clusters are currently supported only on Hadoop 20.205 or later. The AWS CLI: This is needed to interact with Hbase using the command-line options. Use of Ganglia tool: For monitoring, it's advisable to use the Ganglia tool; this provides all information related to performance and can be installed as a client lib when we create the cluster. The logs for Hbase: They are available on the master node; it's a standard practice in a production environment to copy these logs to the Amazon S3 cluster. How to do it Open a browser and copy the following URL: (https://console.aws.amazon.com/elasticmapreduce/); if you don't have an Amazon AWS account, then you have to create it. Then choose Create cluster as shown in the following: Provide the cluster name; you must select Launch mode as cluster Let's proceed to the software configuration section. There are two options: Amazon template or MapR template. We are going to use Amazon template. It will load the default applications, which includes Hbase. Security is key when you are using ssh to the login to the cluster. Let's create a security key, by selecting NETWORK & SECURITY on the left section of the panel (as shown in the following). We have created as Hbase03: Once you create this security key, it will ask for a download of a .pem file , which is known as hbase03.pem. Copy this file to the user location and change the access to: chmod 400 <hbase03.pem> This will ensure the write level of access is there on the file and is not accessible Two-way. Now select this pair from the drop-down box in the EC2 Key pair; this will allow you to register the instance to the key while provisioning the instance. You can do this later too, but I had some challenges in doing this so it is always better to provision the instance with the property Now, you are ready to provision the EMR cluster. Go ahead and provision the cluster. It will take around 10 to 20 mins to have a cluster fully accessible and in a running Condition. Verify it by observing the console: How it works When you select the cluster name, it maps to your associate account internally and keeps the mapping the cluster is alive (or net destroyed). When you select an installation using the setting it loads all the respective JAR files, which allows it to perform in a fully distributed environment. You can select the EMR or MapR stable release, which allows us to load the compatible library, and hence focus on the solution rather than troubleshooting integration issues within the Hadoop/Hbase farms. Internally, all the slaves connects to the master, and hence, we considered an extra-large VM. Connecting to an Hbase cluster using the command line How to do it You can alternatively SSH to the node and see the details as follows: Once you have connected to the cluster, you can perform all the tasks which you can perform on local clusters. The preceding screenshot gives the details of the components we selected while installing the cluster. Let's connect to the Hbase shell to make sure all the components are connecting internally and we are able to create a sample table. How it works The communication between your machine and the Hbase cluster works by passing a key every time a command is executed; this allows the communication to be private. The shell becomes the remote shell that connects to the Hbase master via a private connection. All the base shell commands such as put, create, and scan get all the known Hbase commands. Backing up and restoring Hbase Amazon Elastic MapReduce provides multiple ways to back up and restore Hbase data to S3 cloud. It also allows us to do an incremental backup; during the backup process Hbase continues to execute the write commands, helping us to keep working while the backup process continues. There is a risk of having an inconsistency in the data. If consistency is of prime importance, then the write needs to be stopped during the initial backup process, synchronized across nodes. This can be achieved by passing the–consistent parameter when requesting a backup. When you back up HBase data, you should specify a different backup directory for each Cluster. An easy way to do this is to use the cluster identifier as part of the path specified for the backup directory. For example, s3://mybucket/backups/j-3AEXXXXXX16F2. This ensures that any future incremental backups reference the correct HBase cluster. How to do it When you are ready to delete old backup files that are no longer needed, we recommend that you first do a full backup of your HBase data. This ensures that all data is preserved and provides a baseline for future incremental backups. Once the full backup is done, you can navigate to the backup location and manually delete the old backup files: While creating a cluster, add an additional step scheduling regular backups, as shown in the following. You have to specify the location of the backup to which a backup file with the data will be kept, based on the backup frequency selected. For highly valuable data, you can have a backup on an hourly basis. For less sensitive data, it can be planned daily: It's a good practice to backup to a separate location in Amazon S3 to ensure that incremental backups are calculated correctly. It's important to specify the exact time from when the backup will be started, the time zone specified is UTC for our cluster. We can proceed with creating the cluster as planned; it will create a backup of the data to the location specified. You have to provide the exact location of the backup file and restore it. The version that is backed up needs to be specified and saved, which will allow the data to be restored. How it works During the backup process, Hbase continues to execute write commands; this ensures the cluster remains available throughout the backup process. Internally, the operation is done in parallel, thus there is a chance of it being inconsistent. If the use case requires consistency, then we have to pause the write to Hbase. This can be achieved by passing the consistent parameter while requesting a backup. This internally queues the writes and executes them as soon as the synchronization complete. We learned about configuration of Hbase for the cloud, connected Hbase cluster using the command line, and performed backup & restore of Hbase. If you found this post useful, do check out the book ‘HBase High Perforamnce Cookbook’ to learn other concepts such as terminating an HBase Cluster, accessing HBase data with hive, viewing HBase log files, etc. Understanding the HBase Ecosystem Configuring HBase 5 Mistake Developers make when working with HBase

0
0
2759

article-image-configuring-and-deploying-hbase-tutorial

Natasha Mathur

02 Jul 2018

21 min read

Configuring and deploying HBase [Tutorial]

Natasha Mathur

02 Jul 2018

21 min read

0
0
2956

article-image-administration-rights-for-power-bi-users

Pravin Dhandre

02 Jul 2018

8 min read

Administration rights for Power BI users

Pravin Dhandre

02 Jul 2018

8 min read

In this tutorial, you will understand and learn administration rights/rules for Power BI users. This includes setting and monitoring rules like; who in the organization can utilize which feature, how Power BI Premium capacity is allocated and by whom, and other settings such as embed codes and custom visuals. This article is an excerpt from a book written by Brett Powell titled Mastering Microsoft Power BI. The admin portal is accessible to Office 365 Global Administrators and users mapped to the Power BI service administrator role. To open the admin portal, log in to the Power BI service and select the Admin portal item from the Settings (Gear icon) menu in the top right, as shown in the following screenshot: All Power BI users, including Power BI free users, are able to access the Admin portal. However, users who are not admins can only view the Capacity settings page. The Power BI service administrators and Office 365 global administrators have view and edit access to the following seven pages: Administrators of Power BI most commonly utilize the Tenant settings and Capacity settings as described in the Tenant Settings and Power BI Premium Capacities sections later in this tutorial. However, the admin portal can also be used to manage any approved custom visuals for the organization. Usage metrics The Usage metrics page of the Admin portal provides admins with a Power BI dashboard of several top metrics, such as the most consumed dashboards and the most consumed dashboards by workspace. However, the dashboard cannot be modified and the tiles of the dashboard are not linked to any underlying reports or separate dashboards to support further analysis. Given these limitations, alternative monitoring solutions are recommended, such as the Office 365 audit logs and usage metric datasets specific to Power BI apps. Details of both monitoring options are included in the app usage metrics and Power BI audit log activities sections later in this chapter. Users and Audit logs The Users and Audit logs pages only provide links to the Office 365 admin center. In the admin center, Power BI users can be added, removed and managed. If audit logging is enabled for the organization via the Create audit logs for internal activity and auditing and compliance tenant setting, this audit log data can be retrieved from the Office 365 Security & Compliance Center or via PowerShell. This setting is noted in the following section regarding the Tenant settings tab of the Power BI admin portal. An Office 365 license is not required to utilize the Office 365 admin center for Power BI license assignments or to retrieve Power BI audit log activity. Tenant settings The Tenant settings page of the Admin portal allows administrators to enable or disable various features of the Power BI web service. Likewise, the administrator could allow only a certain security group to embed Power BI content in SaaS applications such as SharePoint Online. The following diagram identifies the 18 tenant settings currently available in the admin portal and the scope available to administrators for configuring each setting: From a data security perspective, the first seven settings within the Export and Sharing and Content packs and apps groups are most important. For example, many organizations choose to disable the Publish to web feature for the entire organization. Additionally, only certain security groups may be allowed to export data or to print hard copies of reports and dashboards. As shown in the Scope column of the previous table and the following example, granular security group configurations are available to minimize risk and manage the overall deployment. Currently, only one tenant setting is available for custom visuals and this setting (Custom visuals settings) can be enabled or disabled for the entire organization only. For organizations that wish to restrict or prohibit custom visuals for security reasons, this setting can be used to eliminate the ability to add, view, share, or interact with custom visuals. More granular controls to this setting are expected later in 2018, such as the ability to define users or security groups of users who are allowed to use custom visuals. In the following screenshot from the Tenant settings page of the Admin portal, only the users within the BI Admin security group who are not also members of the BI Team security group are allowed to publish apps to the entire organization: For example, a report author who also helps administer the On-premises data gateway via the BI Admin security group would be denied the ability to publish apps to the organization given membership in the BI Team security group. Many of the tenant setting configurations will be more simple than this example, particularly for smaller organizations or at the beginning of Power BI deployments. However, as adoption grows and the team responsible for Power BI changes, it's important that the security groups created to help administer these settings are kept up to date. Embed Codes Embed Codes are created and stored in the Power BI service when the Publish to web feature is utilized. As described in the Publish to web section of the previous chapter, this feature allows a Power BI report to be embedded in any website or shared via URL on the public internet. Users with edit rights to the workspace of the published to web content are able to manage the embed codes themselves from within the workspace. However, the admin portal provides visibility and access to embed codes across all workspaces, as shown in the following screenshot: Via the Actions commands on the far right of the Embed Codes page, a Power BI Admin can view the report in a browser (diagonal arrow) or remove the embed code. The Embed Codes page can be helpful to periodically monitor the usage of the Publish to web feature and for scenarios in which data was included in a publish to web report that shouldn't have been, and thus needs to be removed. As shown in the Power BI Tenant settings table referenced in the previous section, this feature can be enabled or disabled for the entire organization or for specific users within security groups. Organizational Custom visuals The Custom Visuals page allows admins to upload and manage custom visuals (.pbiviz files) that have been approved for use within the organization. For example, an organization may have proprietary custom visuals developed internally, which it wishes to expose to business users. Alternatively, the organization may wish to define a set of approved custom visuals, such as only the custom visuals that have been certified by Microsoft. In the following screenshot, the Chiclet Slicer custom visual is added as an organizational custom visual from the Organizational visuals page of the Power BI admin portal: The Organizational visuals page provides a link (Add a custom visual) to launch the form and identifies all uploaded visuals, as well as their last update. Once a visual has been uploaded, it can be deleted but not updated or modified. Therefore, when a new version of an organizational visual becomes available, this visual can be added to the list of organizational visuals with a descriptive title (Chiclet Slicer v2.0). Deleting an organizational custom visual will cause any reports that use this visual to stop rendering. The following screenshot reflects the uploaded Chiclet Slicer custom visual on the Organization visuals page: Once the custom visual has been uploaded as an organizational custom visual, it will be accessible to users in Power BI Desktop. In the following screenshot from Power BI Desktop, the user has opened the MARKETPLACE of custom visuals and selected MY ORGANIZATION: In this screenshot, rather than searching through the MARKETPLACE, the user can go directly to visuals defined by the organization. The marketplace of custom visuals can be launched via either the Visualizations pane or the From Marketplace icon on the Home tab of the ribbon. Organizational custom visuals are not supported for reports or dashboards shared with external users. Additionally, organizational custom visuals used in reports that utilize the publish to web feature will not render outside the Power BI tenant. Moreover, Organizational custom visuals are currently a preview feature. Therefore, users must enable the My organization custom visuals feature via the Preview features tab of the Options window in Power BI Desktop. With this, we got you acquainted with features and processes applicable in administering Power BI for an organization. This includes the configuration of tenant settings in the Power BI admin portal, analyzing the usage of Power BI assets, and monitoring overall user activity via the Office 365 audit logs. If you found this tutorial useful, do check out the book Mastering Microsoft Power BI to develop visually rich, immersive, and interactive Power BI reports and dashboards. Unlocking the secrets of Microsoft Power BI A tale of two tools: Tableau and Power BI Building a Microsoft Power BI Data Model

0
0
6025

article-image-build-and-train-rnn-chatbot-using-tensorflow

Sunith Shetty

28 Jun 2018

21 min read

Build and train an RNN chatbot using TensorFlow [Tutorial]

Sunith Shetty

28 Jun 2018

21 min read

Chatbots are increasingly used as a way to provide assistance to users. Many companies, including banks, mobile/landline companies and large e-sellers now use chatbots for customer assistance and for helping users in pre and post sales queries. They are a great tool for companies which don't need to provide additional customer service capacity for trivial questions: they really look like a win-win situation! In today’s tutorial, we will understand how to train an automatic chatbot that will be able to answer simple and generic questions, and how to create an endpoint over HTTP for providing the answers via an API. This article is an excerpt from a book written by Luca Massaron, Alberto Boschetti, Alexey Grigorev, Abhishek Thakur, and Rajalingappaa Shanmugamani titled TensorFlow Deep Learning Projects. There are mainly two types of chatbot: the first is a simple one, which tries to understand the topic, always providing the same answer for all questions about the same topic. For example, on a train website, the questions Where can I find the timetable of the City_A to City_B service? and What's the next train departing from City_A? will likely get the same answer, that could read Hi! The timetable on our network is available on this page: <link>. This types of chatbots use classification algorithms to understand the topic (in the example, both questions are about the timetable topic). Given the topic, they always provide the same answer. Usually, they have a list of N topics and N answers; also, if the probability of the classified topic is low (the question is too vague, or it's on a topic not included in the list), they usually ask the user to be more specific and repeat the question, eventually pointing out other ways to do the question (send an email or call the customer service number, for example). The second type of chatbots is more advanced, smarter, but also more complex. For those, the answers are built using an RNN, in the same way, that machine translation is performed. Those chatbots are able to provide more personalized answers, and they may provide a more specific reply. In fact, they don't just guess the topic, but with an RNN engine, they're able to understand more about the user's questions and provide the best possible answer: in fact, it's very unlikely you'll get the same answers with two different questions using these types of chatbots. The input corpus Unfortunately, we haven't found any consumer-oriented dataset that is open source and freely available on the Internet. Therefore, we will train the chatbot with a more generic dataset, not really focused on customer service. Specifically, we will use the Cornell Movie Dialogs Corpus, from the Cornell University. The corpus contains the collection of conversations extracted from raw movie scripts, therefore the chatbot will be able to answer more to fictional questions than real ones. The Cornell corpus contains more than 200,000 conversational exchanges between 10+ thousands of movie characters, extracted from 617 movies. The dataset is available here: https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html. We would like to thank the authors for having released the corpus: that makes experimentation, reproducibility and knowledge sharing easier. The dataset comes as a .zip archive file. After decompressing it, you'll find several files in it: README.txt contains the description of the dataset, the format of the corpora files, the details on the collection procedure and the author's contact. Chameleons.pdf is the original paper for which the corpus has been released. Although the goal of the paper is strictly not around chatbots, it studies the language used in dialogues, and it's a good source of information to understanding more movie_conversations.txt contains all the dialogues structure. For each conversation, it includes the ID of the two characters involved in the discussion, the ID of the movie and the list of sentences IDs (or utterances, to be more precise) in chronological order. For example, the first line of the file is: u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197'] That means that user u0 had a conversation with user u2 in the movie m0 and the conversation had 4 utterances: 'L194', 'L195', 'L196' and 'L197' movie_lines.txt contains the actual text of each utterance ID and the person who produced it. For example, the utterance L195 is listed here as: L195 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Well, I thought we'd start with pronunciation, if that's okay with you. So, the text of the utterance L195 is Well, I thought we'd start with pronunciation, if that's okay with you. And it was pronounced by the character u2 whose name is CAMERON in the movie m0. movie_titles_metadata.txt contains information about the movies, including the title, year, IMDB rating, the number of votes in IMDB and the genres. For example, the movie m0 here is described as: m0 +++$+++ 10 things i hate about you +++$+++ 1999 +++$+++ 6.90 +++$+++ 62847 +++$+++ ['comedy', 'romance'] So, the title of the movie whose ID is m0 is 10 things i hate about you, it's from 1999, it's a comedy with romance and it received almost 63 thousand votes on IMDB with an average score of 6.9 (over 10.0) movie_characters_metadata.txt contains information about the movie characters, including the name the title of the movie where he/she appears, the gender (if known) and the position in the credits (if known). For example, the character “u2” appears in this file with this description: u2 +++$+++ CAMERON +++$+++ m0 +++$+++ 10 things i hate about you +++$+++ m +++$+++ 3 The character u2 is named CAMERON, it appears in the movie m0 whose title is 10 things i hate about you, his gender is male and he's the third person appearing in the credits. raw_script_urls.txt contains the source URL where the dialogues of each movie can be retrieved. For example, for the movie m0 that's it: m0 +++$+++ 10 things i hate about you +++$+++ http://www.dailyscript.com/scripts/10Things.html As you will have noticed, most files use the token +++$+++ to separate the fields. Beyond that, the format looks pretty straightforward to parse. Please take particular care while parsing the files: their format is not UTF-8 but ISO-8859-1. Creating the training dataset Let's now create the training set for the chatbot. We'd need all the conversations between the characters in the correct order: fortunately, the corpora contains more than what we actually need. For creating the dataset, we will start by downloading the zip archive, if it's not already on disk. We'll then decompress the archive in a temporary folder (if you're using Windows, that should be C:Temp), and we will read just the movie_lines.txt and the movie_conversations.txt files, the ones we really need to create a dataset of consecutive utterances. Let's now go step by step, creating multiple functions, one for each step, in the file corpora_downloader.py. The first function we need is to retrieve the file from the Internet, if not available on disk. def download_and_decompress(url, storage_path, storage_dir): import os.path directory = storage_path + "/" + storage_dir zip_file = directory + ".zip" a_file = directory + "/cornell movie-dialogs corpus/README.txt" if not os.path.isfile(a_file): import urllib.request import zipfile urllib.request.urlretrieve(url, zip_file) with zipfile.ZipFile(zip_file, "r") as zfh: zfh.extractall(directory) return This function does exactly that: it checks whether the “README.txt” file is available locally; if not, it downloads the file (thanks for the urlretrieve function in the urllib.request module) and it decompresses the zip (using the zipfile module). The next step is to read the conversation file and extract the list of utterance IDS. As a reminder, its format is: u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197'], therefore what we're looking for is the fourth element of the list after we split it on the token +++$+++ . Also, we'd need to clean up the square brackets and the apostrophes to have a clean list of IDs. For doing that, we shall import the re module, and the function will look like this. import re def read_conversations(storage_path, storage_dir): filename = storage_path + "/" + storage_dir + "/cornell movie-dialogs corpus/movie_conversations.txt" with open(filename, "r", encoding="ISO-8859-1") as fh: conversations_chunks = [line.split(" +++$+++ ") for line in fh] return [re.sub('[[]']', '', el[3].strip()).split(", ") for el in conversations_chunks] As previously said, remember to read the file with the right encoding, otherwise, you'll get an error. The output of this function is a list of lists, each of them containing the sequence of utterance IDS in a conversation between characters. Next step is to read and parse the movie_lines.txt file, to extract the actual utterances texts. As a reminder, the file looks like this line: L195 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Well, I thought we'd start with pronunciation, if that's okay with you. Here, what we're looking for are the first and the last chunks. def read_lines(storage_path, storage_dir): filename = storage_path + "/" + storage_dir + "/cornell movie-dialogs corpus/movie_lines.txt" with open(filename, "r", encoding="ISO-8859-1") as fh: lines_chunks = [line.split(" +++$+++ ") for line in fh] return {line[0]: line[-1].strip() for line in lines_chunks} The very last bit is about tokenization and alignment. We'd like to have a set whose observations have two sequential utterances. In this way, we will train the chatbot, given the first utterance, to provide the next one. Hopefully, this will lead to a smart chatbot, able to reply to multiple questions. Here's the function: def get_tokenized_sequencial_sentences(list_of_lines, line_text): for line in list_of_lines: for i in range(len(line) - 1): yield (line_text[line[i]].split(" "), line_text[line[i+1]].split(" ")) Its output is a generator containing a tuple of the two utterances (the one on the right follows temporally the one on the left). Also, utterances are tokenized on the space character. Finally, we can wrap up everything into a function, which downloads the file and unzip it (if not cached), parse the conversations and the lines, and format the dataset as a generator. As a default, we will store the files in the /tmp directory: def retrieve_cornell_corpora(storage_path="/tmp", storage_dir="cornell_movie_dialogs_corpus"): download_and_decompress("http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip", storage_path, storage_dir) conversations = read_conversations(storage_path, storage_dir) lines = read_lines(storage_path, storage_dir) return tuple(zip(*list(get_tokenized_sequencial_sentences(conversations, lines)))) At this point, our training set looks very similar to the training set used in the translation project. We can, therefore, use some pieces of code we've developed in the machine learning translation article. For example, the corpora_tools.py file can be used here without any change (also, it requires the data_utils.py). Given that file, we can dig more into the corpora, with a script to check the chatbot input. To inspect the corpora, we can use the corpora_tools.py, and the file we've previously created. Let's retrieve the Cornell Movie Dialog Corpus, format the corpora and print an example and its length: from corpora_tools import * from corpora_downloader import retrieve_cornell_corpora sen_l1, sen_l2 = retrieve_cornell_corpora() print("# Two consecutive sentences in a conversation") print("Q:", sen_l1[0]) print("A:", sen_l2[0]) print("# Corpora length (i.e. number of sentences)") print(len(sen_l1)) assert len(sen_l1) == len(sen_l2) This code prints an example of two tokenized consecutive utterances, and the number of examples in the dataset, that is more than 220,000: # Two consecutive sentences in a conversation Q: ['Can', 'we', 'make', 'this', 'quick?', '', 'Roxanne', 'Korrine', 'and', 'Andrew', 'Barrett', 'are', 'having', 'an', 'incredibly', 'horrendous', 'public', 'break-', 'up', 'on', 'the', 'quad.', '', 'Again.'] A: ['Well,', 'I', 'thought', "we'd", 'start', 'with', 'pronunciation,', 'if', "that's", 'okay', 'with', 'you.'] # Corpora length (i.e. number of sentences) 221616 Let's now clean the punctuation in the sentences, lowercase them and limits their size to 20 words maximum (that is examples where at least one of the sentences is longer than 20 words are discarded). This is needed to standardize the tokens: clean_sen_l1 = [clean_sentence(s) for s in sen_l1] clean_sen_l2 = [clean_sentence(s) for s in sen_l2] filt_clean_sen_l1, filt_clean_sen_l2 = filter_sentence_length(clean_sen_l1, clean_sen_l2) print("# Filtered Corpora length (i.e. number of sentences)") print(len(filt_clean_sen_l1)) assert len(filt_clean_sen_l1) == len(filt_clean_sen_l2) This leads us to almost 140,000 examples: # Filtered Corpora length (i.e. number of sentences) 140261 Then, let's create the dictionaries for the two sets of sentences. Practically, they should look the same (since the same sentence appears once on the left side, and once in the right side) except there might be some changes introduced by the first and last sentences of a conversation (they appear only once). To make the best out of our corpora, let's build two dictionaries of words and then encode all the words in the corpora with their dictionary indexes: dict_l1 = create_indexed_dictionary(filt_clean_sen_l1, dict_size=15000, storage_path="/tmp/l1_dict.p") dict_l2 = create_indexed_dictionary(filt_clean_sen_l2, dict_size=15000, storage_path="/tmp/l2_dict.p") idx_sentences_l1 = sentences_to_indexes(filt_clean_sen_l1, dict_l1) idx_sentences_l2 = sentences_to_indexes(filt_clean_sen_l2, dict_l2) print("# Same sentences as before, with their dictionary ID") print("Q:", list(zip(filt_clean_sen_l1[0], idx_sentences_l1[0]))) print("A:", list(zip(filt_clean_sen_l2[0], idx_sentences_l2[0]))) That prints the following output. We also notice that a dictionary of 15 thousand entries doesn't contain all the words and more than 16 thousand (less popular) of them don't fit into it: [sentences_to_indexes] Did not find 16823 words [sentences_to_indexes] Did not find 16649 words # Same sentences as before, with their dictionary ID Q: [('well', 68), (',', 8), ('i', 9), ('thought', 141), ('we', 23), ("'", 5), ('d', 83), ('start', 370), ('with', 46), ('pronunciation', 3), (',', 8), ('if', 78), ('that', 18), ("'", 5), ('s', 12), ('okay', 92), ('with', 46), ('you', 7), ('.', 4)] A: [('not', 31), ('the', 10), ('hacking', 7309), ('and', 23), ('gagging', 8761), ('and', 23), ('spitting', 6354), ('part', 437), ('.', 4), ('please', 145), ('.', 4)] As the final step, let's add paddings and markings to the sentences: data_set = prepare_sentences(idx_sentences_l1, idx_sentences_l2, max_length_l1, max_length_l2) print("# Prepared minibatch with paddings and extra stuff") print("Q:", data_set[0][0]) print("A:", data_set[0][1]) print("# The sentence pass from X to Y tokens") print("Q:", len(idx_sentences_l1[0]), "->", len(data_set[0][0])) print("A:", len(idx_sentences_l2[0]), "->", len(data_set[0][1])) And that, as expected, prints: # Prepared minibatch with paddings and extra stuff Q: [0, 68, 8, 9, 141, 23, 5, 83, 370, 46, 3, 8, 78, 18, 5, 12, 92, 46, 7, 4] A: [1, 31, 10, 7309, 23, 8761, 23, 6354, 437, 4, 145, 4, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0] # The sentence pass from X to Y tokens Q: 19 -> 20 A: 11 -> 22 Training the chatbot After we're done with the corpora, it's now time to work on the model. This project requires again a sequence to sequence model, therefore we can use an RNN. Even more, we can reuse part of the code from the previous project: we'd just need to change how the dataset is built, and the parameters of the model. We can then copy the training script, and modify the build_dataset function, to use the Cornell dataset. Mind that the dataset used in this article is bigger than the one used in the machine learning translation article, therefore you may need to limit the corpora to a few dozen thousand lines. On a 4 years old laptop with 8GB RAM, we had to select only the first 30 thousand lines, otherwise, the program ran out of memory and kept swapping. As a side effect of having fewer examples, even the dictionaries are smaller, resulting in less than 10 thousands words each. def build_dataset(use_stored_dictionary=False): sen_l1, sen_l2 = retrieve_cornell_corpora() clean_sen_l1 = [clean_sentence(s) for s in sen_l1][:30000] ### OTHERWISE IT DOES NOT RUN ON MY LAPTOP clean_sen_l2 = [clean_sentence(s) for s in sen_l2][:30000] ### OTHERWISE IT DOES NOT RUN ON MY LAPTOP filt_clean_sen_l1, filt_clean_sen_l2 = filter_sentence_length(clean_sen_l1, clean_sen_l2, max_len=10) if not use_stored_dictionary: dict_l1 = create_indexed_dictionary(filt_clean_sen_l1, dict_size=10000, storage_path=path_l1_dict) dict_l2 = create_indexed_dictionary(filt_clean_sen_l2, dict_size=10000, storage_path=path_l2_dict) else: dict_l1 = pickle.load(open(path_l1_dict, "rb")) dict_l2 = pickle.load(open(path_l2_dict, "rb")) dict_l1_length = len(dict_l1) dict_l2_length = len(dict_l2) idx_sentences_l1 = sentences_to_indexes(filt_clean_sen_l1, dict_l1) idx_sentences_l2 = sentences_to_indexes(filt_clean_sen_l2, dict_l2) max_length_l1 = extract_max_length(idx_sentences_l1) max_length_l2 = extract_max_length(idx_sentences_l2) data_set = prepare_sentences(idx_sentences_l1, idx_sentences_l2, max_length_l1, max_length_l2) return (filt_clean_sen_l1, filt_clean_sen_l2), data_set, (max_length_l1, max_length_l2), (dict_l1_length, dict_l2_length) By inserting this function into the train_translator.py file and rename the file as train_chatbot.py, we can run the training of the chatbot. After a few iterations, you can stop the program and you'll see something similar to this output: [sentences_to_indexes] Did not find 0 words [sentences_to_indexes] Did not find 0 words global step 100 learning rate 1.0 step-time 7.708967611789704 perplexity 444.90090078460474 eval: perplexity 57.442316329639176 global step 200 learning rate 0.990234375 step-time 7.700247814655302 perplexity 48.8545568311572 eval: perplexity 42.190180314697045 global step 300 learning rate 0.98046875 step-time 7.69800933599472 perplexity 41.620538109894945 eval: perplexity 31.291903031786116 ... ... ... global step 2400 learning rate 0.79833984375 step-time 7.686293318271639 perplexity 3.7086356605442767 eval: perplexity 2.8348589631663046 global step 2500 learning rate 0.79052734375 step-time 7.689657487869262 perplexity 3.211876894960698 eval: perplexity 2.973809378544393 global step 2600 learning rate 0.78271484375 step-time 7.690396382808681 perplexity 2.878854805600354 eval: perplexity 2.563583924617356 Again, if you change the settings, you may end up with a different perplexity. To obtain these results, we set the RNN size to 256 and 2 layers, the batch size of 128 samples, and the learning rate to 1.0. At this point, the chatbot is ready to be tested. Although you can test the chatbot with the same code as in the test_translator.py, here we would like to do a more elaborate solution, which allows exposing the chatbot as a service with APIs. Chatbox API First of all, we need a web framework to expose the API. In this project, we've chosen Bottle, a lightweight simple framework very easy to use. To install the package, run pip install bottle from the command line. To gather further information and dig into the code, take a look at the project webpage, https://bottlepy.org. Let's now create a function to parse an arbitrary sentence provided by the user as an argument. All the following code should live in the test_chatbot_aas.py file. Let's start with some imports and the function to clean, tokenize and prepare the sentence using the dictionary: import pickle import sys import numpy as np import tensorflow as tf import data_utils from corpora_tools import clean_sentence, sentences_to_indexes, prepare_sentences from train_chatbot import get_seq2seq_model, path_l1_dict, path_l2_dict model_dir = "/home/abc/chat/chatbot_model" def prepare_sentence(sentence, dict_l1, max_length): sents = [sentence.split(" ")] clean_sen_l1 = [clean_sentence(s) for s in sents] idx_sentences_l1 = sentences_to_indexes(clean_sen_l1, dict_l1) data_set = prepare_sentences(idx_sentences_l1, [[]], max_length, max_length) sentences = (clean_sen_l1, [[]]) return sentences, data_set The function prepare_sentence does the following: Tokenizes the input sentence Cleans it (lowercase and punctuation cleanup) Converts tokens to dictionary IDs Add markers and paddings to reach the default length Next, we will need a function to convert the predicted sequence of numbers to an actual sentence composed of words. This is done by the function decode, which runs the prediction given the input sentence and with softmax predicts the most likely output. Finally, it returns the sentence without paddings and markers: def decode(data_set): with tf.Session() as sess: model = get_seq2seq_model(sess, True, dict_lengths, max_sentence_lengths, model_dir) model.batch_size = 1 bucket = 0 encoder_inputs, decoder_inputs, target_weights = model.get_batch( {bucket: [(data_set[0][0], [])]}, bucket) _, _, output_logits = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket, True) outputs = [int(np.argmax(logit, axis=1)) for logit in output_logits] if data_utils.EOS_ID in outputs: outputs = outputs[1:outputs.index(data_utils.EOS_ID)] tf.reset_default_graph() return " ".join([tf.compat.as_str(inv_dict_l2[output]) for output in outputs]) Finally, the main function, that is, the function to run in the script: if __name__ == "__main__": dict_l1 = pickle.load(open(path_l1_dict, "rb")) dict_l1_length = len(dict_l1) dict_l2 = pickle.load(open(path_l2_dict, "rb")) dict_l2_length = len(dict_l2) inv_dict_l2 = {v: k for k, v in dict_l2.items()} max_lengths = 10 dict_lengths = (dict_l1_length, dict_l2_length) max_sentence_lengths = (max_lengths, max_lengths) from bottle import route, run, request @route('/api') def api(): in_sentence = request.query.sentence _, data_set = prepare_sentence(in_sentence, dict_l1, max_lengths) resp = [{"in": in_sentence, "out": decode(data_set)}] return dict(data=resp) run(host='127.0.0.1', port=8080, reloader=True, debug=True) Initially, it loads the dictionary and prepares the inverse dictionary. Then, it uses the Bottle API to create an HTTP GET endpoint (under the /api URL). The route decorator sets and enriches the function to run when the endpoint is contacted via HTTP GET. In this case, the api() function is run, which first reads the sentence passed as HTTP parameter, then calls the prepare_sentence function, described above, and finally runs the decoding step. What's returned is a dictionary containing both the input sentence provided by the user and the reply of the chatbot. Finally, the webserver is turned on, on the localhost at port 8080. Isn't very easy to have a chatbot as a service with Bottle? It's now time to run it and check the outputs. To run it, run from the command line: $> python3 –u test_chatbot_aas.py Then, let's start querying the chatbot with some generic questions, to do so we can use CURL, a simple command line; also all the browsers are ok, just remember that the URL should be encoded, for example, the space character should be replaced with its encoding, that is, %20. Curl makes things easier, having a simple way to encode the URL request. Here are a couple of examples: $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=how are you?" {"data": [{"out": "i ' m here with you .", "in": "where are you?"}]} $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=are you here?" {"data": [{"out": "yes .", "in": "are you here?"}]} $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=are you a chatbot?" {"data": [{"out": "you ' for the stuff to be right .", "in": "are you a chatbot?"}]} $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=what is your name ?" {"data": [{"out": "we don ' t know .", "in": "what is your name ?"}]} $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=how are you?" {"data": [{"out": "that ' s okay .", "in": "how are you?"}]} If the system doesn't work with your browser, try encoding the URL, for example: $> curl -X GET http://127.0.0.1:8080/api?sentence=how%20are%20you? {"data": [{"out": "that ' s okay .", "in": "how are you?"}]}. Replies are quite funny; always remember that we trained the chatbox on movies, therefore the type of replies follow that style. To turn off the webserver, use Ctrl + C. To summarize, we've learned to implement a chatbot, which is able to respond to questions through an HTTP endpoint and a GET API. To know more how to design deep learning systems for a variety of real-world scenarios using TensorFlow, do checkout this book TensorFlow Deep Learning Projects. Facebook’s Wit.ai: Why we need yet another chatbot development framework? How to build a chatbot with Microsoft Bot framework Top 4 chatbot development frameworks for developers

0
1
20477

article-image-interact-with-hbase-using-hbase-shell-tutorial

Amey Varangaonkar

25 Jun 2018

9 min read

How to interact with HBase using HBase shell [Tutorial]

Amey Varangaonkar

25 Jun 2018

9 min read

0
0
4874

article-image-use-tensorflow-and-nlp-to-detect-duplicate-quora-questions-tutorial

Sunith Shetty

21 Jun 2018

32 min read

Use TensorFlow and NLP to detect duplicate Quora questions [Tutorial]

Sunith Shetty

21 Jun 2018

32 min read

0
1
15964

article-image-splunk-how-to-work-with-multiple-indexes-tutorial

Pravin Dhandre

20 Jun 2018

12 min read

Splunk: How to work with multiple indexes [Tutorial]

Pravin Dhandre

20 Jun 2018

12 min read

An index in Splunk is a storage pool for events, capped by size and time. By default, all events will go to the index specified by defaultDatabase, which is called main but lives in a directory called defaultdb. In this tutorial, we put focus to index structures, need of multiple indexes, how to size an index and how to manage multiple indexes in a Splunk environment. This article is an excerpt from a book written by James D. Miller titled Implementing Splunk 7 - Third Edition. Directory structure of an index Each index occupies a set of directories on the disk. By default, these directories live in $SPLUNK_DB, which, by default, is located in $SPLUNK_HOME/var/lib/splunk. Look at the following stanza for the main index: [main] homePath = $SPLUNK_DB/defaultdb/db coldPath = $SPLUNK_DB/defaultdb/colddb thawedPath = $SPLUNK_DB/defaultdb/thaweddb maxHotIdleSecs = 86400 maxHotBuckets = 10 maxDataSize = auto_high_volume If our Splunk installation lives at /opt/splunk, the index main is rooted at the path /opt/splunk/var/lib/splunk/defaultdb. To change your storage location, either modify the value of SPLUNK_DB in $SPLUNK_HOME/etc/splunk-launch.conf or set absolute paths in indexes.conf. splunk-launch.conf cannot be controlled from an app, which means it is easy to forget when adding indexers. For this reason, and for legibility, I would recommend using absolute paths in indexes.conf. The homePath directories contain index-level metadata, hot buckets, and warm buckets. coldPath contains cold buckets, which are simply warm buckets that have aged out. See the upcoming sections The lifecycle of a bucket and Sizing an index for details. When to create more indexes There are several reasons for creating additional indexes. If your needs do not meet one of these requirements, there is no need to create more indexes. In fact, multiple indexes may actually hurt performance if a single query needs to open multiple indexes. Testing data If you do not have a test environment, you can use test indexes for staging new data. This then allows you to easily recover from mistakes by dropping the test index. Since Splunk will run on a desktop, it is probably best to test new configurations locally, if possible. Differing longevity It may be the case that you need more history for some source types than others. The classic example here is security logs, as compared to web access logs. You may need to keep security logs for a year or more, but need the web access logs for only a couple of weeks. If these two source types are left in the same index, security events will be stored in the same buckets as web access logs and will age out together. To split these events up, you need to perform the following steps: Create a new index called security, for instance Define different settings for the security index Update inputs.conf to use the new index for security source types For one year, you might make an indexes.conf setting such as this: [security] homePath = $SPLUNK_DB/security/db coldPath = $SPLUNK_DB/security/colddb thawedPath = $SPLUNK_DB/security/thaweddb #one year in seconds frozenTimePeriodInSecs = 31536000 For extra protection, you should also set maxTotalDataSizeMB, and possibly coldToFrozenDir. If you have multiple indexes that should age together, or if you will split homePath and coldPath across devices, you should use volumes. See the upcoming section, Using volumes to manage multiple indexes, for more information. Then, in inputs.conf, you simply need to add an index to the appropriate stanza as follows: [monitor:///path/to/security/logs/logins.log] sourcetype=logins index=security Differing permissions If some data should only be seen by a specific set of users, the most effective way to limit access is to place this data in a different index, and then limit access to that index by using a role. The steps to accomplish this are essentially as follows: Define the new index. Configure inputs.conf or transforms.conf to send these events to the new index. Ensure that the user role does not have access to the new index. Create a new role that has access to the new index. Add specific users to this new role. If you are using LDAP authentication, you will need to map the role to an LDAP group and add users to that LDAP group. To route very specific events to this new index, assuming you created an index called sensitive, you can create a transform as follows: [contains_password] REGEX = (?i)password[=:] DEST_KEY = _MetaData:Index FORMAT = sensitive You would then wire this transform to a particular sourcetype or source index in props.conf. Using more indexes to increase performance Placing different source types in different indexes can help increase performance if those source types are not queried together. The disks will spend less time seeking when accessing the source type in question. If you have access to multiple storage devices, placing indexes on different devices can help increase the performance even more by taking advantage of different hardware for different queries. Likewise, placing homePath and coldPath on different devices can help performance. However, if you regularly run queries that use multiple source types, splitting those source types across indexes may actually hurt performance. For example, let's imagine you have two source types called web_access and web_error. We have the following line in web_access: 2012-10-19 12:53:20 code=500 session=abcdefg url=/path/to/app And we have the following line in web_error: 2012-10-19 12:53:20 session=abcdefg class=LoginClass If we want to combine these results, we could run a query like the following: (sourcetype=web_access code=500) OR sourcetype=web_error | transaction maxspan=2s session | top url class If web_access and web_error are stored in different indexes, this query will need to access twice as many buckets and will essentially take twice as long. The life cycle of a bucket An index is made up of buckets, which go through a specific life cycle. Each bucket contains events from a particular period of time. The stages of this lifecycle are hot, warm, cold, frozen, and thawed. The only practical difference between hot and other buckets is that a hot bucket is being written to, and has not necessarily been optimized. These stages live in different places on the disk and are controlled by different settings in indexes.conf: homePath contains as many hot buckets as the integer value of maxHotBuckets, and as many warm buckets as the integer value of maxWarmDBCount. When a hot bucket rolls, it becomes a warm bucket. When there are too many warm buckets, the oldest warm bucket becomes a cold bucket. Do not set maxHotBuckets too low. If your data is not parsing perfectly, dates that parse incorrectly will produce buckets with very large time spans. As more buckets are created, these buckets will overlap, which means all buckets will have to be queried every time, and performance will suffer dramatically. A value of five or more is safe. coldPath contains cold buckets, which are warm buckets that have rolled out of homePath once there are more warm buckets than the value of maxWarmDBCount. If coldPath is on the same device, only a move is required; otherwise, a copy is required. Once the values of frozenTimePeriodInSecs, maxTotalDataSizeMB, or maxVolumeDataSizeMB are reached, the oldest bucket will be frozen. By default, frozen means deleted. You can change this behavior by specifying either of the following: coldToFrozenDir: This lets you specify a location to move the buckets once they have aged out. The index files will be deleted, and only the compressed raw data will be kept. This essentially cuts the disk usage by half. This location is unmanaged, so it is up to you to watch your disk usage. coldToFrozenScript: This lets you specify a script to perform some action when the bucket is frozen. The script is handed the path to the bucket that is about to be frozen. thawedPath can contain buckets that have been restored. These buckets are not managed by Splunk and are not included in all time searches. To search these buckets, their time range must be included explicitly in your search. I have never actually used this directory. Search https://splunk.com for restore archived to learn the procedures. Sizing an index To estimate how much disk space is needed for an index, use the following formula: (gigabytes per day) * .5 * (days of retention desired) Likewise, to determine how many days you can store an index, the formula is essentially: (device size in gigabytes) / ( (gigabytes per day) * .5 ) The .5 represents a conservative compression ratio. The log data itself is usually compressed to 10 percent of its original size. The index files necessary to speed up search brings the size of a bucket closer to 50 percent of the original size, though it is usually smaller than this. If you plan to split your buckets across devices, the math gets more complicated unless you use volumes. Without using volumes, the math is as follows: homePath = (maxWarmDBCount + maxHotBuckets) * maxDataSize coldPath = maxTotalDataSizeMB - homePath For example, say we are given these settings: [myindex] homePath = /splunkdata_home/myindex/db coldPath = /splunkdata_cold/myindex/colddb thawedPath = /splunkdata_cold/myindex/thaweddb maxWarmDBCount = 50 maxHotBuckets = 6 maxDataSize = auto_high_volume #10GB on 64-bit systems maxTotalDataSizeMB = 2000000 Filling in the preceding formula, we get these values: homePath = (50 warm + 6 hot) * 10240 MB = 573440 MB coldPath = 2000000 MB - homePath = 1426560 MB If we use volumes, this gets simpler and we can simply set the volume sizes to our available space and let Splunk do the math. Using volumes to manage multiple indexes Volumes combine pools of storage across different indexes so that they age out together. Let's make up a scenario where we have five indexes and three storage devices. The indexes are as follows: Name Data per day Retention required Storage needed web 50 GB no requirement ? security 1 GB 2 years 730 GB * 50 percent app 10 GB no requirement ? chat 2 GB 2 years 1,460 GB * 50 percent web_summary 1 GB 1 years 365 GB * 50 percent Now let's say we have three storage devices to work with, mentioned in the following table: Name Size small_fast 500 GB big_fast 1,000 GB big_slow 5,000 GB We can create volumes based on the retention time needed. Security and chat share the same retention requirements, so we can place them in the same volumes. We want our hot buckets on our fast devices, so let's start there with the following configuration: [volume:two_year_home] #security and chat home storage path = /small_fast/two_year_home maxVolumeDataSizeMB = 300000 [volume:one_year_home] #web_summary home storage path = /small_fast/one_year_home maxVolumeDataSizeMB = 150000 For the rest of the space needed by these indexes, we will create companion volume definitions on big_slow, as follows: [volume:two_year_cold] #security and chat cold storage path = /big_slow/two_year_cold maxVolumeDataSizeMB = 850000 #([security]+[chat])*1024 - 300000 [volume:one_year_cold] #web_summary cold storage path = /big_slow/one_year_cold maxVolumeDataSizeMB = 230000 #[web_summary]*1024 - 150000 Now for our remaining indexes, whose timeframe is not important, we will use big_fast and the remainder of big_slow, like so: [volume:large_home] #web and app home storage path = /big_fast/large_home maxVolumeDataSizeMB = 900000 #leaving 10% for pad [volume:large_cold] #web and app cold storage path = /big_slow/large_cold maxVolumeDataSizeMB = 3700000 #(big_slow - two_year_cold - one_year_cold)*.9 Given that the sum of large_home and large_cold is 4,600,000 MB, and a combined daily volume of web and app is 60,000 MB approximately, we should retain approximately 153 days of web and app logs with 50 percent compression. In reality, the number of days retained will probably be larger. With our volumes defined, we now have to reference them in our index definitions: [web] homePath = volume:large_home/web coldPath = volume:large_cold/web thawedPath = /big_slow/thawed/web [security] homePath = volume:two_year_home/security coldPath = volume:two_year_cold/security thawedPath = /big_slow/thawed/security coldToFrozenDir = /big_slow/frozen/security [app] homePath = volume:large_home/app coldPath = volume:large_cold/app thawedPath = /big_slow/thawed/app [chat] homePath = volume:two_year_home/chat coldPath = volume:two_year_cold/chat thawedPath = /big_slow/thawed/chat coldToFrozenDir = /big_slow/frozen/chat [web_summary] homePath = volume:one_year_home/web_summary coldPath = volume:one_year_cold/web_summary thawedPath = /big_slow/thawed/web_summary thawedPath cannot be defined using a volume and must be specified for Splunk to start. For extra protection, we specified coldToFrozenDir for the indexes' security and chat. The buckets for these indexes will be copied to this directory before deletion, but it is up to us to make sure that the disk does not fill up. If we allow the disk to fill up, Splunk will stop indexing until space is made available. This is just one approach to using volumes. You could overlap in any way that makes sense to you, as long as you understand that the oldest bucket in a volume will be frozen first, no matter what index put the bucket in that volume. With this, we learned to operate multiple indexes and how we can get effective business intelligence out of the data without hurting system performance. If you found this tutorial useful, do check out the book Implementing Splunk 7 - Third Edition and start creating advanced Splunk dashboards. Splunk leverages AI in its monitoring tools Splunk’s Input Methods and Data Feeds Splunk Industrial Asset Intelligence (Splunk IAI) targets Industrial IoT marketplace

0
0
9217

How-To Tutorials - Data

Google AI releases Cirq and Open Fermion-Cirq to boost Quantum computation

How to set up an Ethereum development environment [Tutorial]

5 reasons government should regulate technology

Extending OpenAI Gym environments with Wrappers and Monitors [Tutorial]

The software behind Silicon Valley’s Emmy-nominated 'Not Hotdog' app

Tensorflow 1.9 is now generally available

Create a TensorFlow LSTM that writes stories [Tutorial]

Implement Reinforcement learning using Markov Decision Process [Tutorial]

How to optimize Hbase for the Cloud [Tutorial]

Configuring and deploying HBase [Tutorial]

Trending Topics

Administration rights for Power BI users

Build and train an RNN chatbot using TensorFlow [Tutorial]

How to interact with HBase using HBase shell [Tutorial]

Use TensorFlow and NLP to detect duplicate Quora questions [Tutorial]

Splunk: How to work with multiple indexes [Tutorial]