Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-handling-backup-and-recovery-in-postgresql-10
Savia Lobo
18 Jun 2018
11 min read
Save for later

Handling backup and recovery in PostgreSQL 10 [Tutorial]

Savia Lobo
18 Jun 2018
11 min read
Performing backups should be a regular task and every administrator is supposed to keep an eye on this vital stuff. Fortunately, PostgreSQL provides an easy means to create backups. In this tutorial, you will learn how to backup data by performing some simple dumps and also recover them using PostgreSQL 10. This article is an excerpt taken from, 'Mastering PostgreSQL 10' written by Hans-Jürgen Schönig. This book highlights the newly introduced features in PostgreSQL 10, and shows you how to build better PostgreSQL applications. Performing simple dumps If you are running a PostgreSQL setup, there are two major methods to perform backups: Logical dumps (extract an SQL script representing your data) Transaction log shipping The idea behind transaction log shipping is to archive binary changes made to the database. Most people claim that transaction log shipping is the only real way to do backups. However, in my opinion, this is not necessarily true. Many people rely on pg_dump to simply extract a textual representation of the data. pg_dump is also the oldest method of creating a backup and has been around since the very early days of the project (transaction log shipping was added much later). Every PostgreSQL administrator will become familiar with pg_dump sooner or later, so it is important to know how it really works and what it does. Running pg_dump The first thing we want to do is to create a simple textual dump: [hs@linuxpc ~]$ pg_dump test > /tmp/dump.sql This is the most simplistic backup you can imagine. pg_dump logs into the local database instance connects to a database test and starts to extract all the data, which will be sent to stdout and redirected to the file. The beauty is that standard output gives you all the flexibility of a Unix system. You can easily compress the data using a pipe or do whatever you want. In some cases, you might want to run pg_dump as a different user. All PostgreSQL client programs support a consistent set of command-line parameters to pass user information. If you just want to set the user, use the -U flag: [hs@linuxpc ~]$ pg_dump -U whatever_powerful_user test > /tmp/dump.sql The following set of parameters can be found in all PostgreSQL client programs: ... Connection options: -d, --dbname=DBNAME database to dump -h, --host=HOSTNAME database server host or socket directory -p, --port=PORT database server port number -U, --username=NAME connect as specified database user -w, --no-password never prompt for password -W, --password force password prompt (should happen automatically) --role=ROLENAME do SET ROLE before dump ... Just pass the information you want to pg_dump, and if you have enough permissions, PostgreSQL will fetch the data. The important thing here is to see how the program really works. Basically, pg_dump connects to the database and opens a large repeatable read transaction that simply reads all the data. Remember, repeatable read ensures that PostgreSQL creates a consistent snapshot of the data, which does not change throughout the transactions. In other words, a dump is always consistent—no foreign keys will be violated. The output is a snapshot of data as it was when the dump started. Consistency is a key factor here. It also implies that changes made to the data while the dump is running won't make it to the backup anymore. A dump simply reads everything—therefore, there are no separate permissions to be able to dump something. As long as you can read it, you can back it up. Also, note that the backup is by default in a textual format. This means that you can safely extract data from say, Solaris, and move it to some other CPU architecture. In the case of binary copies, that is clearly not possible as the on-disk format depends on your CPU architecture. Passing passwords and connection information If you take a close look at the connection parameters shown in the previous section, you will notice that there is no way to pass a password to pg_dump. You can enforce a password prompt, but you cannot pass the parameter to pg_dump using a command-line option. The reason for that is simple: the password might show up in the process table and be visible to other people. Therefore, this is not supported. The question now is: if pg_hba.conf on the server enforces a password, how can the client program provide it? There are various means of doing that: Making use of environment variables Making use of .pgpass Using service files In this section, you will learn about all three methods. Using environment variables One way to pass all kinds of parameters is to use environment variables. If the information is not explicitly passed to pg_dump, it will look for the missing information in predefined environment variables. A list of all potential settings can be found here: https://www.postgresql.org/docs/10/static/libpq-envars.html. The following overview shows some environment variables commonly needed for backups: PGHOST: It tells the system which host to connect to PGPORT: It defines the TCP port to be used PGUSER: It tells a client program about the desired user PGPASSWORD: It contains the password to be used PGDATABASE: It is the name of the database to connect to The advantage of these environments is that the password won't show up in the process table. However, there is more. Consider the following example: psql -U ... -h ... -p ... -d ... Suppose you are a system administrator: do you really want to type a long line like that a couple of times every day? If you are working with the very same host again and again, just set those environment variables and connect with plain SQL: [hs@linuxpc ~]$ export PGHOST=localhost [hs@linuxpc ~]$ export PGUSER=hs [hs@linuxpc ~]$ export PGPASSWORD=abc [hs@linuxpc ~]$ export PGPORT=5432 [hs@linuxpc ~]$ export PGDATABASE=test [hs@linuxpc ~]$ psql psql (10.1) Type "help" for help. As you can see, there are no command-line parameters anymore. Just type psql and you are in. All applications based on the standard PostgreSQLC-language client library (libpq) will understand these environment variables, so you cannot only use them for psql and pg_dump, but for many other applications. Making use of .pgpass A very common way to store login information is via the use of .pgpass files. The idea is simple: put a file called .pgpass into your home directory and put your login information there. The format is simple: hostname:port:database:username:password An example would be: 192.168.0.45:5432:mydb:xy:abc PostgreSQL offers some nice additional functionality: most fields can contain *. Here is an example: *:*:*:xy:abc This means that on every host, on every port, for every database, the user called xy will use abc as the password. To make PostgreSQL use the .pgpass file, make sure that the right file permissions are in place: chmod 0600 ~/.pgpass .pgpass can also be used on a Windows system. In this case, the file can be found in the %APPDATA%postgresqlpgpass.conf path. Using service files However, there is not just the .pgpass file. You can also make use of service files. Here is how it works. If you want to connect to the very same servers over and over again, you can create a .pg_service.conf file. It will hold all the connection information you need. Here is an example of a .pg_service.conf file: Mac:~ hs$ cat .pg_service.conf # a sample service [hansservice] host=localhost port=5432 dbname=test user=hs password=abc [paulservice] host=192.168.0.45 port=5432 dbname=xyz user=paul password=cde To connect to one of the services, just set the environment and connect: iMac:~ hs$ export PGSERVICE=hansservice A connection can now be established without passing parameters to psql: iMac:~ hs$ psql psql (10.1) Type "help" for help. test=# Alternatively, you can use: psql service=hansservice Extracting subsets of data Up to now, you have seen how to dump an entire database. However, this is not what you might wish for. In many cases, you might just want to extract a subset of tables or schemas. pg_dump can do that and provides a number of switches: -a: It dumps only the data and does not dump the data structure -s: It dumps only the data structure but skips the data -n: It dumps only a certain schema -N: It dumps everything but excludes certain schemas -t: It dumps only certain tables -T: It dumps everything but certain tables (this can make sense if you want to exclude logging tables and so on) Partial dumps can be very useful to speed things up considerably. Handling various formats So far, you have seen that pg_dump can be used to create text files. The problem is that a text file can only be replayed completely. If you have saved an entire database, you can only replay the entire thing. In many cases, this is not what you want. Therefore, PostgreSQL has additional formats that also offer more functionality. At this point, four formats are supported: -F, --format=c|d|t|p output file format (custom, directory, tar, plain text (default)) You have already seen plain, which is just normal text. On top of that, you can use a custom format. The idea behind a custom format is to have a compressed dump, including a table of contents. Here are two ways to create a custom format dump: [hs@linuxpc ~]$ pg_dump -Fc test > /tmp/dump.fc [hs@linuxpc ~]$ pg_dump -Fc test -f /tmp/dump.fc In addition to the table of contents, the compressed dump has one more advantage: it is a lot smaller. The rule of thumb is that a custom format dump is around 90% smaller than the database instance you are about to back up. Of course, this highly depends on the number of indexes and all that, but for many database applications, this rough estimation will hold true. Once you have created the backup, you can inspect the backup file: [hs@linuxpc ~]$ pg_restore --list /tmp/dump.fc ; ; Archive created at 2017-11-04 15:44:56 CET ; dbname: test ; TOC Entries: 18 ; Compression: -1 ; Dump Version: 1.12-0 ; Format: CUSTOM ; Integer: 4 bytes ; Offset: 8 bytes ; Dumped from database version: 10.1 ; Dumped by pg_dump version: 10.1 ; ; Selected TOC Entries: ; 3103; 1262 16384 DATABASE - test hs 3; 2615 2200 SCHEMA - public hs 3104; 0 0 COMMENT - SCHEMA public hs 1; 3079 13350 EXTENSION - plpgsql 3105; 0 0 COMMENT - EXTENSION plpgsql 187; 1259 16391 TABLE public t_test hs ... pg_restore --list will return the table of contents of the backup. Using a custom format is a good idea as the backup will shrink in size. However, there is more; the -Fd command will create a backup in a directory format. Instead of a single file, you will now get a directory containing a couple of files: [hs@linuxpc ~]$ mkdir /tmp/backup [hs@linuxpc ~]$ pg_dump -Fd test -f /tmp/backup/ [hs@linuxpc ~]$ cd /tmp/backup/ [hs@linuxpc backup]$ ls -lh total 86M -rw-rw-r--. 1 hs hs 85M Jan 4 15:54 3095.dat.gz -rw-rw-r--. 1 hs hs 107 Jan 4 15:54 3096.dat.gz -rw-rw-r--. 1 hs hs 740K Jan 4 15:54 3097.dat.gz -rw-rw-r--. 1 hs hs 39 Jan 4 15:54 3098.dat.gz -rw-rw-r--. 1 hs hs 4.3K Jan 4 15:54 toc.dat One advantage of the directory format is that you can use more than one core to perform the backup. In the case of a plain or custom format, only one process will be used by pg_dump. The directory format changes that rule. The following example shows how you can tell pg_dump to use four cores (jobs): [hs@linuxpc backup]$ rm -rf * [hs@linuxpc backup]$ pg_dump -Fd test -f /tmp/backup/ -j 4 Note that the more objects you have in your database, the more potential speedup there will be. To summarize, you learned about creating backups in general. If you've enjoyed reading this post, do check out 'Mastering PostgreSQL 10' to know how to replay backup and handle global data in PostgreSQL10.  You will learn how to use PostgreSQL onboard tools to replicate instances. PostgreSQL 11 Beta 1 is out! How to perform data partitioning in PostgreSQL 10 How to implement Dynamic SQL in PostgreSQL 10
Read more
  • 0
  • 0
  • 4560

article-image-data-professionals-planning-to-learn-this-year-python-deep-learning
Amey Varangaonkar
14 Jun 2018
4 min read
Save for later

What are data professionals planning to learn this year? Python, deep learning, yes. But also...

Amey Varangaonkar
14 Jun 2018
4 min read
One thing that every data professional absolutely dreads is the day their skills are no longer relevant in the market. In an ever-changing tech landscape, one must be constantly on the lookout for the most relevant, industrially-accepted tools and frameworks. This is applicable everywhere - from application and web developers to cybersecurity professionals. Not even the data professionals are excluded from this, as new ways and means to extract actionable insights from raw data are being found out almost every day. Gone are the days when data pros stuck to a single language and a framework to work with their data. Frameworks are more flexible now, with multiple dependencies across various tools and languages. Not just that, new domains are being identified where these frameworks can be applied, and how they can be applied varies massively as well. A whole new arena of possibilities has opened up, and with that new set of skills and toolkits to work on these domains have also been unlocked. What’s the next big thing for data professionals? We recently polled thousands of data professionals as part of our Skill-Up program, and got some very interesting insights into what they think the future of data science looks like. We asked them what they were planning to learn in the next 12 months. The following word cloud is the result of their responses, weighted by frequency of the tools they chose: What data professionals are planning on learning in the next 12 months Unsurprisingly, Python comes out on top as the language many data pros want to learn in the coming months. With its general-purpose nature and innumerable applications across various use-cases, Python’s sky-rocketing popularity is the reason everybody wants to learn it. Machine learning and AI are finding significant applications in the web development domain today. They are revolutionizing the customers’ digital experience through conversational UIs or chatbots. Not just that, smart machine learning algorithms are being used to personalize websites and their UX. With all these reasons, who wouldn’t want to learn JavaScript, as an important tool to have in their data science toolkit? Add to that the trending web dev framework Angular, and you have all the tools to build smart, responsive front-end web applications. We also saw data professionals taking active interest in the mobile and cloud domains as well. They aim to learn Kotlin and combine its power with the data science tools for developing smarter and more intelligent Android apps. When it comes to the cloud, Microsoft’s Azure platform has introduced many built-in machine learning capabilities, as well as a workbench for data scientists to develop effective, enterprise-grade models. Data professionals also prefer Docker containers to run their applications seamlessly, and hence its learning need seems to be quite high. [box type="shadow" align="" class="" width=""]Has machine learning with JavaScript caught your interest? Don’t worry, we got you covered - check out Hands-on Machine Learning with JavaScript for a practical, hands-on coverage of the essential machine learning concepts using the leading web development language. [/box] With Crypto’s popularity off the roof (sadly, we can’t say the same about Bitcoin’s price), data pros see Blockchain as a valuable skill. Building secure, decentralized apps is on the agenda for many, perhaps. Cloud, Big Data, Artificial Intelligence are some of the other domains that the data pros find interesting, and feel worth skilling up in. Work-related skills that data pros want to learn We also asked the data professionals what skills the data pros wanted to learn in the near future that could help them with their daily jobs more effectively. The following word cloud of their responses paints a pretty clear picture: Valuable skills data professionals want to learn for their everyday work As Machine learning and AI go mainstream, so do their applications in mainstream domains - often resulting in complex problems. Well, there’s deep learning and specifically neural networks to tackle these problems, and these are exactly the skills data pros want to master in order to excel at their work. [box type="shadow" align="" class="" width=""]Data pros want to learn Machine Learning in Python. Do you? Here’s a useful resource for you to get started - check out Python Machine Learning, Second Edition today![/box] So, there it is! What are the tools, languages or frameworks that you are planning to learn in the coming months? Do you agree with the results of the poll? Do let us know. What are web developers favorite front-end tools? Packt’s Skill Up report reveals all Data cleaning is the worst part of data analysis, say data scientists 15 Useful Python Libraries to make your Data Science tasks Easier
Read more
  • 0
  • 0
  • 4132

article-image-alarming-ways-governments-use-surveillance-tech
Neil Aitken
14 Jun 2018
12 min read
Save for later

Alarming ways governments are using surveillance tech to watch you

Neil Aitken
14 Jun 2018
12 min read
Mapquest, part of the Verizon company, is the second largest provider of mapping services in the world, after Google Maps. It provides advanced cartography services to companies like Snap and PapaJohns pizza. The company is about to release an app that users can install on their smartphone. Their new application will record and transmit video images of what’s happening in front of your vehicle, as you travel. Data can be sent from any phone with a camera – using the most common of tools – a simple mobile data plan, for example. In exchange, you’ll get live traffic updates, among other things. Mapquest will use the video image data they gather to provide more accurate and up to date maps to their partners. The real world is changing all the time – roads get added, cities re-route traffic from time to time. The new AI based technology Mapquest employ could well improve the reliability of driverless cars, which have to engage with this ever changing landscape, in a safe manner. No-one disagrees with safety improvements. Mapquests solution is impressive technology. The fact that they can use AI to interpret the images they see and upload the information they receive to update maps is incredible. And, in this regard, the company is just one of the myriad daily news stories which excite and astound us. These stories do, however, often have another side to them which is rarely acknowledged. In the wrong hands, Mapquest’s solution could create a surveillance database which tracked people in real time. Surveillance technology involves the use of data and information products to capture details about individuals. The act of surveillance is usually undertaken with a view to achieving a goal. The principle is simple. The more ‘they’ know about you, the easier it will be to influence you towards their ends. Surveillance information can be used to find you, apprehend you or potentially, to change your mind, without even realising that you had been watched. Mapquest’s innovation is just a single example of surveillance technology in government hands which has expanded in capability far beyond what most people realise. Read also: What does the US government know about you? The truth beyond the Facebook scandal Facebook’s share price fell 14% in early 2018 as a result of public outcry related to the Cambridge Analytica announcements the company made. The idea that a private company had allowed detailed information about individuals to be provided to a third party without their consent appeared to genuinely shock and appall people. Technology tools like Mapquest’s tracking capabilities and Facebook’s profiling techniques are being taken and used by police forces and corporate entities around the world. The reality of current private and public surveillance capabilities is that facilities exist, and are in use, to collect and analyse data on most people in the developing world. The known limits of these services may surprise even those who are on the cutting edge of technology. There are so many examples from all over the world listed below that will genuinely make you want to consider going off grid! Innovative, Ingenious overlords: US companies have a flare for surveillance The US is the centre for information based technology companies. Much of what they develop is exported as well as used domestically. The police are using human genome matching to track down criminals and can find ‘any family in the country’ There have been 2 recent examples of police arresting a suspect after using human genome databases to investigate crimes. A growing number of private individuals have now used publicly available services such as 23andme to sequence their genome (DNA) either to investigate further their family tree, or to determine the potential of a pre-disposition to the gene based component of a disease. In one example, The Golden State Killer, an ex cop, was arrested 32 years after the last reported rape in a series of 45 (in addition to 12 murders) which occurred between 1976 and 1986. To track him down, police approached sites like 23andme with DNA found at crime scenes, established a family match and then progressed the investigation using conventional means. More than 12 million Americans have now used a genetic sequencing service and it is believed that investigators could find a family match for the DNA of anyone who has committed a crime in America. In simple terms, whether you want it or not, the law enforcement has the DNA of every individual in the country available to them. Domain Awareness Centers (DAC) bring the Truman Show to life The 400,000 Residents of Oakland, California discovered in 2012, that they had been the subject of an undisclosed mass surveillance project, by the local police force, for many years. Feeds from CCTV cameras installed in Oakland’s suburbs were augmented with weather information feeds, social media feeds and extracted email conversations, as well as a variety of other sources. The scheme began at Oakland’s port with Federal funding as part of a national response to the events of 9.11.2001 but was extended to cover the near half million residents of the city. Hundreds of additional video cameras were installed, along with gunshot recognition microphones and some of the other surveillance technologies provided in this article. The police force conducting the surveillance had no policy on what information was recorded or for how long it was kept. Internet connected toys spy on children The FBI has warned Americans that children’s toys connected to the internet ‘could put the privacy and safety of children at risk.' Children’s toy Hello Barbie was specifically admonished for poor privacy controls as part of the FBI’s press release. Internet connected toys could be used to record video of children at any point in the day or, conceivably, to relay a human voice, making it appear to the child that the toy was talking to them. Oracle suggest Google’s Android operating system routinely tracks users’ position even when maps are turned off In Australia, two American companies have been involved in a disagreement about the potential monitoring of Android phones. Oracle accused Google of monitoring users’ location (including altitude), even when mapping software is turned off on the device. The tracking is performed in the background of their phone. In Australia alone, Oracle suggested that Google’s monitoring could involve around 1GB of additional mobile data every month, costing users nearly half a billion dollars a year, collectively. Amazon facial recognition in real time helps US law enforcement services Amazon are providing facial recognition services which take a feed from public video cameras, to a number of US Police Forces. Amazon can match images taken in real time to a database containing ‘millions of faces.’ Are there any state or Federal rules in place to govern police facial recognition? Wired reported that there are ‘more or less none.’ Amazon’s scheme is a trial taking place in Florida. There are at least 2 other companies offering similar schemes in the US to law enforcement services. Big glass microphone can help agencies keep an ear on the ground Project ‘Big Glass Microphone’ uses the vibrations that the movements of cars (among other things) cause in the buried fiber optic telecommunications links. A successful test of the technology has been undertaken on the fiber optic cables which run underground on the Stanford University Campus, to record vehicle movements. Fiber optic links now make up the backbone of much data transport infrastructure - the way your phone and computer connect to the internet. Big glass microphone as it stands is the first step towards ‘invisible’ monitoring of people and their assets. It appears the FBI now have the ability to crack/access any phone Those in the know suggest that Apple’s iPhone is the most secure smart device against government surveillance. In 2016, this was put to the test. The Justice Department came into possession of an iPhone allegedly belonging to one of the San Bernadino shooters and ultimately sued Apple in an attempt to force the company to grant access to it, as part of their investigation. The case was ultimately dropped leading some to speculate that NAND mirroring techniques were used to gain access to the phone without Apple’s assistance, implying that even the most secure phones can now be accessed by authorities. Cornell University’s lie detecting algorithm Groundbreaking work by Cornell University will provide ‘at a distance’ access to information that previously required close personal access to an accused subject. Cornell’s solution interprets feeds from a number of video cameras on subjects and analyses the results to judge their heart rate. They believe the system can be used to determine if someone is lying from behind a screen. University of Southern California can anticipate social unrest with social media feeds Researchers at the University Of Southern California have developed an AI tool to study Social Media posts and determine whether those writing them are likely to cause Social Unrest. The software claims to have identified an association between both the volume of tweets written / the content of those tweets and protests turning physical. They can now offer advice to law enforcement on the likelihood of a protest turning violent so they can be properly prepared based on this information. The UK, an epicenter of AI progress, is not far behind in tracking people The UK has a similarly impressive array of tools at its disposal to watch the people that representatives of the country feels may be required. Given the close levels of cooperation between the UK and US governments, it is likely that many of these UK facilities are shared with the US and other NATO partners. Project stingray – fake cell phone/mobile phone ‘towers’ to intercept communications Stingray is a brand name for an IMSI (the unique identifier on a SIM card) tracker. They ‘spoof’ real towers, presenting themselves as the closest mobile phone tower. This ‘fools’ phones in to connecting to them. The technology has been used to spy on criminals in the UK but it is not just the UK government which use Stingray or its equivalents. The Washington Post reported in June 2018 that a number of domestically compiled intelligence reports suggest that foreign governments acting on US soil, including China and Russia, have been eavesdropping on the Whitehouse, using the same technology. UK developed Spyware is being used by authoritarian regimes Gamma International is a company based in Hampshire UK, which provided the (notably authoritarian) Egyptian government with a facility to install what was effectively spyware delivered with a virus on to computers in their country. Once installed, the software permitted the government to monitor private digital interactions, without the need to engage the phone company or ISP offering those services. Any internet based technology could be tracked, assisting in tracking down individuals who may have negative feelings about the Egyptian government. Individual arrested when his fingerprint was taken from a WhatsApp picture of his hand A Drug Dealer was pictured holding an assortment of pills in the UK two months ago. The image of his hand was used to extract an image of his fingerprint. From that, forensic scientists used by UK police, confirmed that officers had arrested the correct person and associated him with drugs. AI solutions to speed up evidence processing including scanning laptops and phones UK police forces are trying out AI software to speed up processing evidence from digital devices. A dozen departments around the UK are using software, called Cellebrite, which employs AI algorithms to search through data found on devices, including phones and laptops. Cellbrite can recognize images that contain child abuse, accepts feeds from multiple devices to see when multiple owners were in the same physical location at the same time and can read text from screenshots. Officers can even feed it photos of suspects to see if a picture of them show up on someone’s hard drive. China takes the surveillance biscuit and may show us a glimpse of the future There are 600 million mobile phone users in China, each producing a great deal of information about their users. China has a notorious record of human rights abuses and the ruling Communist Party takes a controlling interest (a board seat) in many of their largest technology companies, to ensure the work done is in the interest of the party as well as profitable for the corporate. As a result, China is on the front foot when it comes to both AI and surveillance technology. China’s surveillance tools could be a harbinger of the future in the Western world. Chinese cities will be run by a private company Alibaba, China’s equivalent of Amazon, already has control over the traffic lights in one Chinese city, Hangzhou. Alibaba is far from shy about it’s ambitions. It has 120,000 developers working on the problem and intends to commercialise and sell the data it gathers about citizens. The AI based product they’re using is called CityBrain. In the future, all Chinese cities could well all be run by AI from the Alibaba corporation the idea is to use this trial as a template for every city. The technology is likely to be placed in Kuala Lumpur next. In the areas under CityBrain’s control, traffic speeds have increased by 15% already. However, some of those observing the situation have expressed concerns not just about (the lack of) oversight on CityBrain’s current capabilities but the potential for future abuse. What to make of this incredible list of surveillance capabilities Facilities like Mapquest’s new mapping service are beguiling. They’re clever ideas which create a better works. Similar technology, however, behind the scenes, is being adopted by law enforcement bodies in an ever growing list of countries. Even for someone who understands cutting edge technology, the sum of those facilities may be surprising. Literally any aspect of your behaviour, from the way you walk, to your face, your heatmap and, of course, the contents of your phone and laptops can now be monitored. Law enforcement can access and review information feeds with Artificial Intelligence software, to process and summarise findings quickly. In some cases, this is being done without the need for a warrant. Concerningly, these advances seem to be coming without policy or, in many cases any form of oversight. We must change how we think about AI, urge AI founding fathers  
Read more
  • 0
  • 0
  • 3479
Visually different images

article-image-create-connection-qlik-engine-tip
Amey Varangaonkar
13 Jun 2018
8 min read
Save for later

5 ways to create a connection to the Qlik Engine [Tip]

Amey Varangaonkar
13 Jun 2018
8 min read
With mashups or web apps, the Qlik Engine sits outside of your project and is not accessible and loaded by default. The first step before doing anything else is to create a connection with the Qlik Engine, after which you can continue to open a session and perform further actions on that app, such as: Opening a document/app Making selections Retrieving visualizations and apps For using the Qlik Engine API, open a WebSocket to the engine. There may be a difference in the way you do this, depending on whether you are working with Qlik Sense Enterprise or Qlik Sense Desktop. In this article, we will elaborate on how you can achieve a connection to the Qlik engine and the benefits of doing so. The following excerpt has been taken from the book Mastering Qlik Sense, authored by Martin Mahler and Juan Ignacio Vitantonio. Creating a connection To create a connection using WebSockets, you first need to establish a new web socket communication line. To open a WebSocket to the engine, use one of the following URIs: Qlik Sense Enterprise Qlik Sense Desktop wss://server.domain.com:4747/app/ or wss://server.domain.com[/virtual proxy]/app/ ws://localhost:4848/app Creating a Connection using WebSockets In the case of Qlik Sense Desktop, all you need to do is define a WebSocket variable, including its connection string in the following way: var ws = new WebSocket("ws://localhost:4848/app/"); Once the connection is opened and checking for ws.open(), you can call additional methods to the engine using ws.send(). This example will retrieve the number of available documents in my Qlik Sense Desktop environment, and append them to an HTML list: <html> <body> <ul id='docList'> </ul> </body> </html> <script> var ws = new WebSocket("ws://localhost:4848/app/"); var request = { "handle": -1, "method": "GetDocList", "params": {}, "outKey": -1, "id": 2 } ws.onopen = function(event){ ws.send(JSON.stringify(request)); // Receive the response ws.onmessage = function (event) { var response = JSON.parse(event.data); if(response.method != ' OnConnected'){ var docList = response.result.qDocList; var list = ''; docList.forEach(function(doc){ list += '<li>'+doc.qDocName+'</li>'; }) document.getElementById('docList').innerHTML = list; } } } </script> The preceding example will produce the following output on your browser if you have Qlik Sense Desktop running in the background: All Engine methods and calls can be tested in a user-friendly way by exploring the Qlik Engine in the Dev Hub. A single WebSocket connection can be associated with only one engine session (consisting of the app context, plus the user). If you need to work with multiple apps, you must open a separate WebSocket for each one. If you wish to create a WebSocket connection directly to an app, you can extend the configuration URL to include the application name, or in the case of the Qlik Sense Enterprise, the GUID. You can then use the method from the app class and any other classes as you continue to work with objects within the app. var ws = new WebSocket("ws://localhost:4848/app/MasteringQlikSense.qvf"); Creating  Connection to the Qlik Server Engine Connecting to the engine on a Qlik Sense environment is a little bit different as you will need to take care of authentication first. Authentication is handled in different ways, depending on how you have set up your server configuration, with the most common ones being: Ticketing Certificates Header authentication Authentication also depends on where the code that is interacting with the Qlik Engine is running. If your code is running on a trusted computer, authentication can be performed in several ways, depending on how your installation is configured and where the code is running: If you are running the code from a trusted computer, you can use certificates, which first need to be exported via the QMC If the code is running on a web browser, or certificates are not available, then you must authenticate via the virtual proxy of the server Creating a connection using certificates Certificates can be considered as a seal of trust, which allows you to communicate with the Qlik Engine directly with full permission. As such, only backend solutions ever have access to certificates, and you should guard how you distribute them carefully. To connect using certificates, you first need to export them via the QMC, which is a relatively easy thing to do: Once they are exported, you need to copy them to the folder where your project is located using the following code: <html> <body> <h1>Mastering QS</h1> </body> <script> var certPath = path.join('C:', 'ProgramData', 'Qlik', 'Sense', 'Repository', 'Exported Certificates', '.Local Certificates'); var certificates = { cert: fs.readFileSync(path.resolve(certPath, 'client.pem')), key: fs.readFileSync(path.resolve(certPath, 'client_key.pem')), root: fs.readFileSync(path.resolve(certPath, 'root.pem')) }; // Open a WebSocket using the engine port (rather than going through the proxy) var ws = new WebSocket('wss://server.domain.com:4747/app/', { ca: certificates.root, cert: certificates.cert, key: certificates.key, headers: { 'X-Qlik-User': 'UserDirectory=internal; UserId=sa_engine' } }); ws.onopen = function (event) { // Call your methods } </script> Creating a connection using the Mashup API Now, while connecting to the engine is a fundamental step to start interacting with Qlik, it's very low-level, connecting via WebSockets. For advanced use cases, the Mashup API is one way to help you get up to speed with a more developer-friendly abstraction layer. The Mashup API utilizes the qlik interface as an external interface to Qlik Sense, used for mashups and for including Qlik Sense objects in external web pages. To load the qlik module, you first need to ensure RequireJS is available in your main project file. You will then have to specify the URL of your Qlik Sense environment, as well as the prefix of the virtual proxy, if there is one: <html> <body> <h1>Mastering QS</h1> </body> </html> <script src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.3.5/require.min.js"> <script> //Prefix is used for when a virtual proxy is used with the browser. var prefix = window.location.pathname.substr( 0, window.location.pathname.toLowerCase().lastIndexOf( "/extensions" ) + 1 ); //Config for retrieving the qlik.js module from the Qlik Sense Server var config = { host: window.location.hostname, prefix: prefix, port: window.location.port, isSecure: window.location.protocol === "https:" }; require.config({ baseUrl: (config.isSecure ? "https://" : "http://" ) + config.host + (config.port ? ":" + config.port : "" ) + config.prefix + "resources" }); require(["js/qlik"], function (qlik) { qlik.setOnError( function (error) { console.log(error); }); //Open an App var app = qlik.openApp('MasteringQlikSense.qvf', config); </script> Once you have created the connection to an app, you can start leveraging the full API by conveniently creating HyperCubes, connecting to fields, passing selections, retrieving objects, and much more. The Mashup API is intended for browser-based projects where authentication is handled in the same way as if you were going to open Qlik Sense. If you wish to use the Mashup API, or some parts of it, with a backend solution, you need to take care of authentication first. Creating a connection using enigma.js Enigma is Qlik's open-source promise wrapper for the engine. You can use enigma directly when you're in the Mashup API, or you can load it as a separate module. When you are writing code from within the Mashup API, you can retrieve the correct schema directly from the list of available modules which are loaded together with qlik.js via 'autogenerated/qix/engine-api'.   The following example will connect to a Demo App using enigma.js: define(function () { return function () { require(['qlik','enigma','autogenerated/qix/engine-api'], function (qlik, enigma, schema) { //The base config with all details filled in var config = { schema: schema, appId: "My Demo App.qvf", session:{ host:"localhost", port: 4848, prefix: "", unsecure: true, }, } //Now that we have a config, use that to connect to the //QIX service. enigma.getService("qix" , config).then(function(qlik){ qlik.global.openApp(config.appId) //Open App qlik.global.openApp(config.appId).then(function(app){ //Create SessionObject for FieldList app.createSessionObject( { qFieldListDef: { qShowSystem: false, qShowHidden: false, qShowSrcTables: true, qShowSemantic: true, qShowDerivedFields: true }, qInfo: { qId: "FieldList", qType: "FieldList" } } ).then( function(list) { return list.getLayout(); } ).then( function(listLayout) { return listLayout.qFieldList.qItems; } ).then( function(fieldItems) { console.log(fieldItems) } ); }) } })}}) It's essential to also load the correct schema whenever you load enigma.js. The schema is a collection of the available API methods that can be utilized in each version of Qlik Sense. This means your schema needs to be in sync with your QS version. Thus, we see it is fairly easy to create a stable connection with the Qlik Engine API. If you liked the above excerpt, make sure you check out the book Mastering Qlik Sense to learn more tips and tricks on working with different kinds of data using Qlik Sense and extract useful business insights. How Qlik Sense is driving self-service Business Intelligence Overview of a Qlik Sense® Application’s Life Cycle What we learned from Qlik Qonnections 2018
Read more
  • 0
  • 0
  • 14640

article-image-how-to-prevent-errors-while-using-utilities-for-loading-data-in-teradata
Pravin Dhandre
11 Jun 2018
9 min read
Save for later

How to prevent errors while using utilities for loading data in Teradata

Pravin Dhandre
11 Jun 2018
9 min read
In today’s tutorial we will assist you to overcome the errors that arise while loading, deleting or updating large volumes of data using Teradata Utilities. [box type="note" align="" class="" width=""]This article is an excerpt from Teradata Cookbook co-authored by Abhinav Khandelwal and Rajsekhar Bhamidipati. This book provides recipes to simplify the daily tasks performed by database administrators (DBA) along with providing efficient data warehousing solutions in Teradata database system.[/box] Resolving FastLoad error 2652 When data is being loaded via FastLoad, a table lock is placed on the target table. This means that the table is unavailable for any other operation. A lock on a table is only released when FastLoad encounters the END LOADING command, which terminates phase 2, the so-called application phase. FastLoad may get terminated in phase 1 due to any of the following reasons: Load script results in failure (error code 8 or 12) Load script is aborted by admin or some other session FastLoad fails due to bad record or file Forgetting to add end loading statement in script If so, it keeps a lock on the table, which needs to be released manually. In this recipe, we will see the steps to release FastLoad locks. Getting ready Identify the table on which FastLoad is been ended prematurely and tables are in locked state. You need to have valid credentials for the Teradata Database. Execute the dummy FastLoad script from the same user or the user which has write access to the lock table. A user requires the following privileges/rights in order to execute the FastLoad: SELECT and INSERT (CREATE and DROP or DELETE) access to the target or loading table CREATE and DROP TABLE on error tables SELECT, INSERT, UPDATE, and DELETE are required privileges for the user PUBLIC on the restart log table (SYSADMIN.FASTLOG). There will be a row in the FASTLOG table for each FastLoad job that has not completed in the system. How to do it... Open a notepad and create the following script: .LOGON 127.0.0.1/dbc, dbc; /* Vaild system name and credentials to your system */ .DATABASE Database_Name; /* database under which locked table is */ erorfiles errortable_name, uv_tablename /* same error table name as in script */ begin loading locked_table; /* table which is getting 2652 error */ .END LOADING; /* to end pahse 2 and release the lock */ .LOGOFF; Save it as dummy_fl.txt. Open the windows Command Prompt and execute this using the FastLoad command, as shown in the following screenshot: This dummy script with no insert statement should release the lock on the target Table. Execute Select on the locked table to see if the lock is released on the table. How it works... As FastLoad is designed to work only on empty tables, it becomes necessary that the loading of the table finishes in one go. If the load script is errored out prematurely in phase 2, without encountering the END loading command, it leaves a lock on loading the table. Fastload locks can't be released via the HUT utility, as there are no technical lock on the table. To execute FastLoad, the following are some requirements: Log table: FastLoad puts its progress information in the fastlog table. EMPTY TABLE: FastLoad needs the table to be empty before inserting rows into that table. TWO ERROR TABLES: FastLoad requires two error tables to be created; you just need to name them, and no ddl is required. The first error table records any translation or constraint violation error, whereas the second error table captures errors related to the duplication of values for Unique Primary Indexes (UPI). After the completion of FastLoad, you can analyze these error tables as to why the records got rejected. There's more... If this does not fix the issue, you need to drop the target table and error tables associated with it. Before proceeding with dropping tables, check with the administrator to abort any FastLoad sessions associated with this table. Resolving MLOAD error 2571 MLOAD works in five phases, unlike FastLoad, which only works in two phases. MLOAD can fail in either phase three or four. Figure shows 5 stages of MLOAD. Preliminary: Basic setup. Syntax checking, establishing session with the Teradata Database, creation of error tables (two error tables per target table), and the creation of work tables and log tables are done in this phase. DML Transaction phase: Request is parse through PE and a step plan is generated. Steps and DML are then sent to AMP and stored in appropriate work tables for each target table. Input data sent will be stored in these work tables, which will be applied to the target table later on. Acquisition phase: Unsorted data is sent to AMP in blocks of 64K. Rows are hashed by PI and sent to appropriate AMPs. Utility places locks on target tables in preparation for the application phase to apply rows in target tables. Application phase: Changes are applied to target tables and NUSI subtables. Lock on table is held in this phase. Cleanup phase: If the error code of all the steps is 0, MLOAD successfully completes and releases all the locks on the specified table. This being the case, all empty error tables, worktables, and the log table are dropped. Getting ready Identify the table which is getting affected by error 2571. Make sure no host utility is running on this table and the load job is in a failed state for this table. How to do it... Check on viewpoint for any active utility job for this table. If you find any active job, let it complete. If there is a reason that you need to release the lock, first abort all the sessions of the host utility from viewpoint. Ask your administrator to do it. Execute the following command: RELEASE MLOAD <databasename.tablename>; > If you get a Not able to release MLOAD Lock error, execute the following Command: /* Release lock in application phase */ RELEASE MLOAD <databasename.tablename> in apply; Once the locks are released you need to drop all the associated error tables, the log table, and work tables with it. Re-execute MLOAD after correcting the error. How it works... The Mload utility places a lock in table headers to alert other utilities that a MultiLoad is in session for this table. They include: Acquisition lock: DML allows all DDL allows DROP only Application lock: DML allows SELECT with ACCESS only DDL allows DROP only There's more... If the release lock statement still gives an error and does not release the lock on the table, you need to use SELECT with the ACCESS lock to copy the content of the locked table to a new one and drop the locked tables. If you start receiving the error 7446 Mload table %ID cannot be released because NUSI exists, you need to drop all the NUSI on the table and use ALTER Table to nonfallback to accomplish the task. Resolving failure 7547 This error is associated with the UPDATE statement, which could be SQL based or could be in MLOAD. Various times, while updating the set of rows in a table, the update fails on Failure 7547 Target row updated by multiple source rows. This error will happen when you update the target with multiple rows from the source. This means there are duplicated values present in the source tables. Getting ready Let's create sample volatile tables and insert values into them. After that, we will execute the UPDATE command, which will fail to result in 7547: Create a TARGET TABLE with the following DDL and insert values into it: ** TARGET TABLE** create volatile table accounts ( CUST_ID, CUST_NAME, Sal )with data primary index(cust_id) insert values (1,'will',2000); insert values (2,'bekky',2800); insert values (3,'himesh',4000); Create a SOURCE TABLE with the following DDL and insert values into it: ** SOURCE TABLE** create volatile table Hr_payhike ( CUST_ID, CUST_NAME, Sal_hike ) with data primary index(cust_id) insert values (1,'will',2030); insert values (1,'bekky',3800); insert values (3,'himesh',7000); Execute the MLOAD script. Following the snippet from the MLOAD script, only update part (which will fail): /* Snippet from MLOAD update */ UPDATE ACC FROM ACCOUNTS ACC , Hr_payhike SUPD SET Sal= TUPD.Sal_hike WHERE Acc.CUST_ID = SUPD.CUST_ID; Failure: Target row updated by multiple source rows How to do it... Check for duplicate values in the source table using the following: /*Check for duplicate values in source table*/ SELECT cust_id,count(*) from Hr_payhike group by 1 order by 2 desc The output will be generated with CUST_ID =1 and has two values which are causing errors. The reason for this is that while updating the TARGET table, the optimizer won't be able to understand from which row it should update the TARGET row. Who's salary will be updated Will or Bekky? To resolve the error, execute the following update query: /* Update part of MLOAD */ UPDATE ACC FROM ACCOUNTS ACC , ( SELECT CUST_ID, CUST_NAME, SAL_HIKE FROM Hr_payhike QUALIFY ROW_NUMBER() OVER (PARTITION BY CUST_ID ORDER BY CUST_NAME,SAL_HIKE DESC)=1) SUPD SET Sal= SUPD.Sal_hike WHERE Acc.CUST_ID = SUPD.CUST_ID; Now, the update will run without error. How it works... Failure will happen when you update the target with multiple rows from the source. If you defined a primary index column for your target, and if those columns are in an update query condition, this error will occur. To further resolve this, you can delete the duplicate from the source table itself and execute the original update without any modification. But if the source data can't be changed, then you need to change the update statement. To summarize, we have successfully learned how to overcome or prevent errors while using utilities for loading data into database. You could also check out the Teradata Cookbook  for more than 100 recipes on enterprise data warehousing solutions. 2018 is the year of graph databases. Here’s why. 6 reasons to choose MySQL 8 for designing database solutions Amazon Neptune, AWS’ cloud graph database, is now generally available
Read more
  • 0
  • 0
  • 12416

article-image-3-ways-to-use-indexes-in-teradata-to-improve-database-performance
Pravin Dhandre
11 Jun 2018
15 min read
Save for later

3 ways to use Indexes in Teradata to improve database performance

Pravin Dhandre
11 Jun 2018
15 min read
In this tutorial, we will create solutions to design indexes to help us improve query performance of Teradata database management system. [box type="note" align="" class="" width=""]This article is an excerpt from a book co-authored by Abhinav Khandelwal and Rajsekhar Bhamidipati titled Teradata Cookbook. This book will teach you to tackle problems related to efficient querying, stored procedure searching, and navigation techniques in a Teradata database.[/box] Creating a partitioned primary index to improve performance A PPI (partitioned primary index) is a type of index that enables users to set up databases that provide performance benefits from a data locality, while retaining the benefits of scalability inherent in the hash architecture of the Teradata database. This is achieved by hashing rows to different virtual AMPs, as is done with a normal PI, but also by creating local partitions within each virtual AMP. We will see how PPIs will improve the performance of a query. Getting ready You need to connect to the Teradata database. Let's create a table and insert data into it using the following DDL. This will be a non-partitioned table, as follows: /*NON PPI TABLE DDL*/ CREATE volatile TABLE EMP_SAL_NONPPI ( id INT, Sal INT, dob DATE, o_total INT ) primary index( id)   on commit preserve rows; INSERT into EMP_SAL_NONPPI VALUES (1001,2500,'2017-09-01',890); INSERT into EMP_SAL_NONPPI VALUES (1002,5500,'2017-09-10',890); INSERT into EMP_SAL_NONPPI VALUES (1003,500,'2017-09-02',890); INSERT into EMP_SAL_NONPPI VALUES (1004,54500,'2017-09-05',890); INSERT into EMP_SAL_NONPPI VALUES (1005,900,'2017-09-23',890); INSERT into EMP_SAL_NONPPI VALUES (1006,8900,'2017-08-03',890); INSERT into EMP_SAL_NONPPI VALUES (1007,8200,'2017-08-21',890); INSERT into EMP_SAL_NONPPI VALUES (1008,6200,'2017-08-06',890); INSERT into EMP_SAL_NONPPI VALUES (1009,2300,'2017-08-12',890); INSERT into EMP_SAL_NONPPI VALUES (1010,9200,'2017-08-15',890); Let's check the explain plan of the following query; we are selecting data based on the DOB column using the following code: /*Select on NONPPI table*/ SELECT * from EMP_SAL_NONPPI where dob <= 2017-08-01 Following is the snippet from SQLA showing explain plan of the query: As seen in the following explain plan, an all-rows scan can be costly in terms of CPU and I/O if the table has millions of rows: Explain SELECT * from EMP_SAL_NONPPI where dob <= 2017-08-01; /*EXPLAIN PLAN of SELECT*/ 1) First, we do an all-AMPs RETRIEVE step from DBC.EMP_SAL_NONPPI by way of an all-rows scan with a condition of ("DBC.EMP_SAL_NONPPI.dob <= DATE '1900-12-31'") into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with no confidence to be 4 rows (148 bytes). The estimated time for this step is 0.04 seconds. 2) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.04 seconds. Let's see how we can enable partition retrieval in the same query. How to do it... Connect to the Teradata database using SQLA or Studio. Create the following table with the data. We will define a PPI on the column DOB: /*Partition table*/ CREATE volatile TABLE EMP_SAL_PPI ( id INT, Sal int, dob date, o_total int ) primary index( id) PARTITION BY RANGE_N (dob BETWEEN DATE '2017-01-01' AND DATE '2017-12-01' EACH INTERVAL '1' DAY) on commit preserve rows; INSERT into EMP_SAL_PPI VALUES (1001,2500,'2017-09-01',890); INSERT into EMP_SAL_PPI VALUES (1002,5500,'2017-09-10',890); INSERT into EMP_SAL_PPI VALUES (1003,500,'2017-09-02',890); INSERT into EMP_SAL_PPI VALUES (1004,54500,'2017-09-05',890); INSERT into EMP_SAL_PPI VALUES (1005,900,'2017-09-23',890); INSERT into EMP_SAL_PPI VALUES (1006,8900,'2017-08-03',890); INSERT into EMP_SAL_PPI VALUES (1007,8200,'2017-08-21',890); INSERT into EMP_SAL_PPI VALUES (1008,6200,'2017-08-06',890); INSERT into EMP_SAL_PPI VALUES (1009,2300,'2017-08-12',890); INSERT into EMP_SAL_PPI VALUES (1010,9200,'2017-08-15',890); Let's execute the same query on a new partition table: /*SELECT on PPI table*/ sel * from EMP_SAL_PPI where dob <= 2017-08-01 Following snippet from SQLA shows query and explain plan of the query: The data is being accessed using only a single partition, as shown in the following block: /*EXPLAIN PLAN*/ 1) First, we do an all-AMPs RETRIEVE step from a single partition of SYSDBA.EMP_SAL_PPI with a condition of ("SYSDBA.EMP_SAL_PPI.dob = DATE '2017-08-01'") with a residual condition of ( "SYSDBA.EMP_SAL_PPI.dob = DATE '2017-08-01'") into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with no confidence to be 1 row (37 bytes). The estimated time for this step is 0.04 seconds. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.04 seconds. How it works... A partitioned PI helps in improving the performance of a query by avoiding a full table scan elimination. A PPI works the same as a primary index for data distribution, but creates partitions according to ranges or cases, as specified in the table. There are four types of PPI that can be created in a table: Case partitioning: /*CASE partition*/ CREATE TABLE SALES_CASEPPI ( ORDER_ID INTEGER, CUST_ID INTERGER, ORDER_DT DATE, ) PRIMARY INDEX(ORDER_ID) PARTITION BY CASE_N(ORDER_ID < 101, ORDER_ ID < 201, ORDER_ID < 501, NO CASE,UNKNOWN); Range-based partitioning: /*Range Partition table*/ CREATE volatile TABLE EMP_SAL_PPI ( id INT, Sal int, dob date, o_total int ) primary index( id) PARTITION BY RANGE_N (dob BETWEEN DATE '2017-01-01' AND DATE '2017-12-01' EACH INTERVAL '1' DAY) on commit preserve rows Multi-level partitioning: CREATE TABLE SALES_MLPPI_TABLE ( ORDER_ID INTEGER NOT NULL, CUST_ID INTERGER, ORDER_DT DATE, ) PRIMARY INDEX(ORDER_ID) PARTITION BY (RANGE_N(ORDER_DT BETWEEN DATE '2017-08-01' AND DATE '2017-12-31' EACH INTERVAL '1' DAY) CASE_N (ORDER_ID < 1001, ORDER_ID < 2001, ORDER_ID < 3001, NO CASE, UNKNOWN)); Character-based partitioning: /*CHAR Partition*/ CREATE TABLE SALES_CHAR_PPI ( ORDR_ID INTEGER, EMP_NAME VARCHAR (30) CHARACTER, PRIMARY INDEX (ORDR_ID) PARTITION BY CASE_N ( EMP_NAME LIKE 'A%', EMP_NAME LIKE 'B%', EMP_NAME LIKE 'C%', EMP_NAME LIKE 'D%', EMP_NAME LIKE 'E%', EMP_NAME LIKE 'F%', NO CASE, UNKNOWN); PPI not only helps in improving the performance of queries, but also helps in table maintenance. But there are certain performance considerations that you might need to keep in mind when creating a PPI on a table, and they are: If partition column criteria is not present in the WHERE clause while selecting primary indexes, it can slow the query The partitioning of the column must be carefully chosen in order to gain maximum benefits Drop unneeded secondary indexes or value-ordered join indexes Creating a join index to improve performance A join index is a data structure that contains data from one or more tables, with or without aggregation: In this, we will see how join indexes help in improving the performance of queries. Getting ready You need to connect to the Teradata database using SQLA or Studio. Let's create a table and insert the following code into it: CREATE TABLE td_cookbook.EMP_SAL ( id INT, DEPT varchar(25), emp_Fname varchar(25), emp_Lname varchar(25), emp_Mname varchar(25), status INT )primary index(id); INSERT into td_cookbook.EMP_SAL VALUES (1,'HR','Anikta','lal','kumar',1); INSERT into td_cookbook.EMP_SAL VALUES (2,'HR','Anik','kumar','kumar',2); INSERT into td_cookbook.EMP_SAL VALUES (3,'IT','Arjun','sharma','lal',1); INSERT into td_cookbook.EMP_SAL VALUES (4,'SALES','Billa','Suti','raj',2); INSERT into td_cookbook.EMP_SAL VALUES (4,'IT','Koyd','Loud','harlod',1); INSERT into td_cookbook.EMP_SAL VALUES (2,'HR','Harlod','lal','kumar',1); Further, we will create a single table join index with a different primary index of the table. How to do it... The following are the steps to create a join index to improve performance: Connect to the Teradata database using SQLA or Studio. Check the explain plan for the following query: /*SELECT on base table*/ EXPLAIN SELECT id,dept,emp_Fname,emp_Lname,status from td_cookbook.EMP_SAL where id=4; 1) First, we do a single-AMP RETRIEVE step from td_cookbook.EMP_SAL by way of the primary index "td_cookbook.EMP_SAL.id = 4" with no residual conditions into Spool 1 (one-amp), which is built locally on that AMP. The size of Spool 1 is estimated with low confidence to be 2 rows (118 bytes). The estimated time for this step is 0.02 seconds. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.02 seconds. Query with a WHERE clause on id; then the system will query the EMP table using the primary index of the base table, which is id. Now, if a user wants to query a table on column emp_Fname, an all row scan will occur, which will degrade the performance of the query, as shown in the following screenshot: Now, we will create a JOIN INDEX using emp_Fname as the primary index: /*Join Index*/ CREATE JOIN INDEX td_cookbook.EMP_JI AS SELECT id,emp_Fname,emp_Lname,status,emp_Mname,dept FROM td_cookbook.EMP_SAL PRIMARY INDEX(emp_Fname); Let's collect statistics on the join index: /*Collect stats on JI*/ collect stats td_cookbook.EMP_JI column emp_Fname Now, we will check the explain plan query on the WHERE clause using the column emp_Fname: Explain sel id,dept,emp_Fname,emp_Lname,status from td_cookbook.EMP_SAL where emp_Fname='ankita'; 1) First, we do a single-AMP RETRIEVE step from td_cookbooK.EMP_JI by way of the primary index "td_cookbooK.EMP_JI.emp_Fname = 'ankita'" with no residual conditions into Spool 1 (one-amp), which is built locally on that AMP. The size of Spool 1 is estimated with low confidence to be 2 rows (118 bytes). The estimated time for this step is 0.02 seconds. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.02 seconds. In EXPLAIN, you can see that the optimizer is using the join index instead of the base table when the table queries are using the Emp_Fname column. How it works... Query performance improves any time a join index can be used instead of the base tables. A join index is most useful when its columns can satisfy, or cover, most or all of the requirements in a query. For example, the optimizer may consider using a covering index instead of performing a merge join. When we are able to cover all the queried columns that can be satisfied by a join index, then it is called a cover query. Covering indexes improve the speed of join queries. The extent of improvement can be dramatic, especially for queries involving complex, large-table, and multiple-table joins. The extent of such improvement depends on how often an index is appropriate to a query. There are a few more join indexes that can be used in Teradata: Aggregate-table join index: A type of join index which pre-joins and summarizes aggregated tables without requiring any physical summary tables. It refreshes automatically whenever the base table changes. Only COUNT and SUM are permitted, and DISTINCT is not permitted: /*AG JOIN INDEX*/ CREATE JOIN INDEX Agg_Join_Index AS SELECT Cust_ID, Order_ID, SUM(Sales_north) -- Aggregate column FROM sales_table GROUP BY 1,2 Primary Index(Cust_ID) Use FLOAT as a data type for COUNT and SUM to avoid overflow. Sparse join index: When a WHERE clause is applied in a JOIN INDEX, it is know as a sparse join index. By limiting the number of rows retrieved in a join, it reduces the size of the join index. It is also useful for UPDATE statements where the index is highly selective: /*SP JOIN INDEX*/ CREATE JOIN INDEX Sparse_Join_Index AS SELECT Cust_ID, Order_ID, SUM(Sales_north) -- Aggregate column FROM sales_table where Order_id = 1 -- WHERE CLAUSE GROUP BY 1,2 Primary Index(Cust_ID) Creating a hash index to improve performance Hash indexes are designed to improve query performance like join indexes, especially single table join indexes, and in addition, they enable you to avoid accessing the base table. The syntax for the hash index is as follows: /*Hash index syntax*/ CREATE HASH INDEX <hash-index-name> [, <fallback-option>] (<column-name-list1>) ON <base-table> [BY (<partition-column-name-list2>)] [ORDER BY <index-sort-spec>] ; Getting ready You need to connect to the Teradata database. Let's create a table and insert data into it using the following DDL: /*Create table with data*/ CREATE TABLE td_cookbook.EMP_SAL ( id INT, DEPT varchar(25), emp_Fname varchar(25), emp_Lname varchar(25), emp_Mname varchar(25), status INT )primary index(id); INSERT into td_cookbook.EMP_SAL VALUES (1,'HR','Anikta','lal','kumar',1); INSERT into td_cookbook.EMP_SAL VALUES (2,'HR','Anik','kumar','kumar',2); INSERT into td_cookbook.EMP_SAL VALUES (3,'IT','Arjun','sharma','lal',1); INSERT into td_cookbook.EMP_SAL VALUES (4,'SALES','Billa','Suti','raj',2); INSERT into td_cookbook.EMP_SAL VALUES (4,'IT','Koyd','Loud','harlod',1); INSERT into td_cookbook.EMP_SAL VALUES (2,'HR','Harlod','lal','kumar',1); How to do it... You need to connect to the Teradata database using SQLA or Studio. Let's check the explain plan of the following query shown in the figure: /*EXPLAIN of SELECT*/ Explain sel id,emp_Fname from td_cookbook.EMP_SAL; 1) First, we lock td_cookbook.EMP_SAL for read on a reserved RowHash to prevent global deadlock. 2) Next, we lock td_cookbook.EMP_SAL for read. 3) We do an all-AMPs RETRIEVE step from td_cookbook.EMP_SAL by way of an all-rows scan with no residual conditions into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with high confidence to be 6 rows (210 bytes). The estimated time for this step is 0.04 seconds. 4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.04 seconds. Now let's create a hash join index on the EMP_SAL table: /*Hash Indx*/ CREATE HASH INDEX td_cookbook.EMP_HASH_inx (id, DEPT) ON td_cookbook.EMP_SAL BY (id) ORDER BY HASH (id); Let's now check the explain plan on the select query after the hash index creation: /*Select after hash idx*/ EXPLAIN SELCT id,dept from td_cookbook.EMP_SAL 1) First, we lock td_cookbooK.EMP_HASH_INX for read on a reserved RowHash to prevent global deadlock. 2) Next, we lock td_cookbooK.EMP_HASH_INX for read. 3) We do an all-AMPs RETRIEVE step from td_cookbooK.EMP_HASH_INX by way of an all-rows scan with no residual conditions into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with high confidence to be 6 rows (210 bytes). The estimated time for this step is 0.04 seconds. 4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.04 seconds. Explain plan can be see in the snippet from SQLA: How it works... Points to consider about the hash index definition are: Each hash index row contains the department id and the department name. Specifying the department id is unnecessary, since it is the primary index of the base table and will therefore be automatically included. The BY clause indicates that the rows of this index will be distributed by the department id hash value. The ORDER BY clause indicates that the index rows will be ordered on each AMP in sequence by the department id hash value. The column specified in the BY clause should be part of the columns which make up the hash index. The BY clause comes with the ORDER BY clause. Unlike join indexes, hash indexes can only be on a single table. We explored how to create different types of index to bring up maximum performance in your database queries. If this article made your way, do check out the book Teradata Cookbook and gain confidence in running a wide variety of Data analytics to develop applications for the Teradata environment. Why MongoDB is the most popular NoSQL database today Why Oracle is losing the Database Race Using the Firebase Real-Time Database  
Read more
  • 0
  • 0
  • 14240
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
article-image-feedforward-networks-tensorflow
Aarthi Kumaraswamy
07 Jun 2018
12 min read
Save for later

Implementing feedforward networks with TensorFlow

Aarthi Kumaraswamy
07 Jun 2018
12 min read
Deep feedforward networks, also called feedforward neural networks, are sometimes also referred to as Multilayer Perceptrons (MLPs). The goal of a feedforward network is to approximate the function of f∗. For example, for a classifier, y=f∗(x) maps an input x to a label y. A feedforward network defines a mapping from input to label y=f(x;θ). It learns the value of the parameter θ that results in the best function approximation. This tutorial is an excerpt from the book, Neural Network Programming with Tensorflow by Manpreet Singh Ghotra, and Rajdeep Dua. With this book, learn how to implement more advanced neural networks like CCNs, RNNs, GANs, deep belief networks and others in Tensorflow. How do feedforward networks work? Feedforward networks are a conceptual stepping stone on the path to recurrent networks, which power many natural language applications. Feedforward neural networks are called networks because they compose together many different functions which represent them. These functions are composed in a directed acyclic graph. The model is associated with a directed acyclic graph describing how the functions are composed together. For example, there are three functions f(1), f(2), and f(3) connected to form f(x) =f(3)(f(2)(f(1)(x))). These chain structures are the most commonly used structures of neural networks. In this case, f(1) is called the first layer of the network, f(2) is called the second layer, and so on. The overall length of the chain gives the depth of the model. It is from this terminology that the name deep learning arises. The final layer of a feedforward network is called the output layer. Diagram showing various functions activated on input x to form a neural network These networks are called neural because they are inspired by neuroscience. Each hidden layer is a vector. The dimensionality of these hidden layers determines the width of the model. Implementing feedforward networks with TensorFlow Feedforward networks can be easily implemented using TensorFlow by defining placeholders for hidden layers, computing the activation values, and using them to calculate predictions. Let's take an example of classification with a feedforward network: X = tf.placeholder("float", shape=[None, x_size]) y = tf.placeholder("float", shape=[None, y_size]) weights_1 = initialize_weights((x_size, hidden_size), stddev) weights_2 = initialize_weights((hidden_size, y_size), stddev) sigmoid = tf.nn.sigmoid(tf.matmul(X, weights_1)) y = tf.matmul(sigmoid, weights_2) Once the predicted value tensor has been defined, we calculate the cost function: cost = tf.reduce_mean(tf.nn.OPERATION_NAME(labels=<actual value>, logits=<predicted value>)) updates_sgd = tf.train.GradientDescentOptimizer(sgd_step).minimize(cost) Here, OPERATION_NAME could be one of the following: tf.nn.sigmoid_cross_entropy_with_logits: Calculates sigmoid cross entropy on incoming logits and labels: sigmoid_cross_entropy_with_logits( _sentinel=None, labels=None, logits=None, name=None )Formula implemented is max(x, 0) - x * z + log(1 + exp(-abs(x))) _sentinel: Used to prevent positional parameters. Internal, do not use. labels: A tensor of the same type and shape as logits. logits: A tensor of type float32 or float64. The formula implemented is ( x = logits, z = labels) max(x, 0) - x * z + log(1 + exp(-abs(x))). tf.nn.softmax: Performs softmax activation on the incoming tensor. This only normalizes to make sure all the probabilities in a tensor row add up to one. It cannot be directly used in a classification. softmax = exp(logits) / reduce_sum(exp(logits), dim) logits: A non-empty tensor. Must be one of the following types--half, float32, or float64. dim: The dimension softmax will be performed on. The default is -1, which indicates the last dimension. name: A name for the operation (optional). tf.nn.log_softmax: Calculates the log of the softmax function and helps in normalizing underfitting. This function is also just a normalization function. log_softmax( logits, dim=-1, name=None ) logits: A non-empty tensor. Must be one of the following types--half, float32, or float64. dim: The dimension softmax will be performed on. The default is -1, which indicates the last dimension. name: A name for the operation (optional). tf.nn.softmax_cross_entropy_with_logits softmax_cross_entropy_with_logits( _sentinel=None, labels=None, logits=None, dim=-1, name=None ) _sentinel: Used to prevent positional parameters. For internal use only. labels: Each rows labels[i] must be a valid probability distribution. logits: Unscaled log probabilities. dim: The class dimension. Defaulted to -1, which is the last dimension. name: A name for the operation (optional). The preceding code snippet computes softmax cross entropy between logits and labels. While the classes are mutually exclusive, their probabilities need not be. All that is required is that each row of labels is a valid probability distribution. For exclusive labels, use (where one and only one class is true at a time) sparse_softmax_cross_entropy_with_logits. tf.nn.sparse_softmax_cross_entropy_with_logits sparse_softmax_cross_entropy_with_logits( _sentinel=None, labels=None, logits=None, name=None ) labels: Tensor of shape [d_0, d_1, ..., d_(r-1)] (where r is the rank of labels and result) and dtype, int32, or int64. Each entry in labels must be an index in [0, num_classes). Other values will raise an exception when this operation is run on the CPU and return NaN for corresponding loss and gradient rows on the GPU. logits: Unscaled log probabilities of shape [d_0, d_1, ..., d_(r-1), num_classes] and dtype, float32, or float64. The preceding code computes sparse softmax cross entropy between logits and labels. The probability of a given label is considered exclusive. Soft classes are not allowed, and the label's vector must provide a single specific index for the true class for each row of logits. tf.nn.weighted_cross_entropy_with_logits weighted_cross_entropy_with_logits( targets, logits, pos_weight, name=None ) targets: A tensor of the same type and shape as logits. logits: A tensor of type float32 or float64. pos_weight: A coefficient to use on the positive examples. This is similar to sigmoid_cross_entropy_with_logits() except that pos_weight allows a trade-off of recall and precision by up or down-weighting the cost of a positive error relative to a negative error. Analyzing the Iris dataset with a Tensorflow feedforward network Let's look at a feedforward example using the Iris dataset. You can download the dataset from https://github.com/ml-resources/neuralnetwork-programming/blob/ed1/ch02/iris/iris.csv and the target labels from https://github.com/ml-resources/neuralnetwork-programming/blob/ed1/ch02/iris/target.csv. In the Iris dataset, we will use 150 rows of data made up of 50 samples from each of three Iris species: Iris setosa, Iris virginica, and Iris versicolor. Petal geometry compared from three iris species: Iris Setosa, Iris Virginica, and Iris Versicolor. In the dataset, each row contains data for each flower sample: sepal length, sepal width, petal length, petal width, and flower species. Flower species are stored as integers, with 0 denoting Iris setosa, 1 denoting Iris versicolor, and 2 denoting Iris virginica. First, we will create a run() function that takes three parameters--hidden layer size h_size, standard deviation for weights stddev, and Step size of Stochastic Gradient Descent sgd_step: def run(h_size, stddev, sgd_step) Input data loading is done using the genfromtxt function in numpy. The Iris data loaded has a shape of L: 150 and W: 4. Data is loaded in the all_X variable. Target labels are loaded from target.csv in all_Y with the shape of L: 150, W:3: def load_iris_data(): from numpy import genfromtxt data = genfromtxt('iris.csv', delimiter=',') target = genfromtxt('target.csv', delimiter=',').astype(int) # Prepend the column of 1s for bias L, W = data.shape all_X = np.ones((L, W + 1)) all_X[:, 1:] = data num_labels = len(np.unique(target)) all_y = np.eye(num_labels)[target] return train_test_split(all_X, all_y, test_size=0.33, random_state=RANDOMSEED) Once data is loaded, we initialize the weights matrix based on x_size, y_size, and h_size with standard deviation passed to the run() method: x_size= 5 y_size= 3 h_size= 128 (or any other number chosen for neurons in the hidden layer) # Size of Layers x_size = train_x.shape[1] # Input nodes: 4 features and 1 bias y_size = train_y.shape[1] # Outcomes (3 iris flowers) # variables X = tf.placeholder("float", shape=[None, x_size]) y = tf.placeholder("float", shape=[None, y_size]) weights_1 = initialize_weights((x_size, h_size), stddev) weights_2 = initialize_weights((h_size, y_size), stddev) Next, we make the prediction using sigmoid as the activation function defined in the forward_propagration() function: def forward_propagation(X, weights_1, weights_2): sigmoid = tf.nn.sigmoid(tf.matmul(X, weights_1)) y = tf.matmul(sigmoid, weights_2) return y First, sigmoid output is calculated from input X and weights_1. This is then used to calculate y as a matrix multiplication of sigmoid and weights_2: y_pred = forward_propagation(X, weights_1, weights_2) predict = tf.argmax(y_pred, dimension=1) Next, we define the cost function and optimization using gradient descent. Let's look at the GradientDescentOptimizer being used. It is defined in the tf.train.GradientDescentOptimizer class and implements the gradient descent algorithm. To construct an instance, we use the following constructor and pass sgd_step as a parameter: # constructor for GradientDescentOptimizer __init__( learning_rate, use_locking=False, name='GradientDescent' ) Arguments passed are explained here: learning_rate: A tensor or a floating point value. The learning rate to use. use_locking: If True, use locks for update operations. name: Optional name prefix for the operations created when applying gradients. The default name is "GradientDescent". The following list shows the code to implement the cost function: cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=y_pred)) updates_sgd = tf.train.GradientDescentOptimizer(sgd_step).minimize(cost) Next, we will implement the following steps: Initialize the TensorFlow session: sess = tf.Session() Initialize all the variables using tf.initialize_all_variables(); the return object is used to instantiate the session. Iterate over steps (1 to 50). For each step in train_x and train_y, execute updates_sgd. Calculate the train_accuracy and test_accuracy. We stored the accuracy for each step in a list so that we could plot a graph: init = tf.initialize_all_variables() steps = 50 sess.run(init) x = np.arange(steps) test_acc = [] train_acc = [] print("Step, train accuracy, test accuracy") for step in range(steps): # Train with each example for i in range(len(train_x)): sess.run(updates_sgd, feed_dict={X: train_x[i: i + 1], y: train_y[i: i + 1]}) train_accuracy = np.mean(np.argmax(train_y, axis=1) == sess.run(predict, feed_dict={X: train_x, y: train_y})) test_accuracy = np.mean(np.argmax(test_y, axis=1) == sess.run(predict, feed_dict={X: test_x, y: test_y})) print("%d, %.2f%%, %.2f%%" % (step + 1, 100. * train_accuracy, 100. * test_accuracy)) test_acc.append(100. * test_accuracy) train_acc.append(100. * train_accuracy) Code execution Let's run this code for h_size of 128, standard deviation of 0.1, and sgd_step of 0.01: def run(h_size, stddev, sgd_step): ... def main(): run(128,0.1,0.01) if __name__ == '__main__': main() The preceding code outputs the following graph, which plots the steps versus the test and train accuracy: Let's compare the change in SGD steps and its effect on training accuracy. The following code is very similar to the previous code example, but we will rerun it for multiple SGD steps to see how SGD steps affect accuracy levels. def run(h_size, stddev, sgd_steps): .... test_accs = [] train_accs = [] time_taken_summary = [] for sgd_step in sgd_steps: start_time = time.time() updates_sgd = tf.train.GradientDescentOptimizer(sgd_step).minimize(cost) sess = tf.Session() init = tf.initialize_all_variables() steps = 50 sess.run(init) x = np.arange(steps) test_acc = [] train_acc = [] print("Step, train accuracy, test accuracy") for step in range(steps): # Train with each example for i in range(len(train_x)): sess.run(updates_sgd, feed_dict={X: train_x[i: i + 1], y: train_y[i: i + 1]}) train_accuracy = np.mean(np.argmax(train_y, axis=1) == sess.run(predict, feed_dict={X: train_x, y: train_y})) test_accuracy = np.mean(np.argmax(test_y, axis=1) == sess.run(predict, feed_dict={X: test_x, y: test_y})) print("%d, %.2f%%, %.2f%%" % (step + 1, 100. * train_accuracy, 100. * test_accuracy)) #x.append(step) test_acc.append(100. * test_accuracy) train_acc.append(100. * train_accuracy) end_time = time.time() diff = end_time -start_time time_taken_summary.append((sgd_step,diff)) t = [np.array(test_acc)] t.append(train_acc) train_accs.append(train_acc) Output of the preceding code will be an array with training and test accuracy for each SGD step value. In our example, we called the function sgd_steps for an SGD step value of [0.01, 0.02, 0.03]: def main(): sgd_steps = [0.01,0.02,0.03] run(128,0.1,sgd_steps) if __name__ == '__main__': main() This is the plot showing how training accuracy changes with sgd_steps. For an SGD value of 0.03, it reaches a higher accuracy faster as the step size is larger. In this post, we built our first neural network, which was feedforward only, and used it for classifying the contents of the Iris dataset. You enjoyed a tutorial from the book, Neural Network Programming with Tensorflow. To implement advanced neural networks like CCNs, RNNs, GANs, deep belief networks and others in Tensorflow, grab your copy today! Neural Network Architectures 101: Understanding Perceptrons How to Implement a Neural Network with Single-Layer Perceptron Deep Learning Algorithms: How to classify Irises using multi-layer perceptrons
Read more
  • 0
  • 0
  • 8685

article-image-how-tflearn-makes-building-tensorflow-models-easier
Savia Lobo
04 Jun 2018
7 min read
Save for later

How TFLearn makes building TensorFlow models easier

Savia Lobo
04 Jun 2018
7 min read
Today, we will introduce you to TFLearn, and will create layers and models which are directly beneficial in any model implementation with Tensorflow. TFLearn is a modular library in Python that is built on top of core TensorFlow. [box type="note" align="" class="" width=""]This article is an excerpt taken from the book Mastering TensorFlow 1.x written by Armando Fandango. In this book, you will learn how to build TensorFlow models to work with multilayer perceptrons using Keras, TFLearn, and R.[/box] TIP: TFLearn is different from the TensorFlow Learn package which is also known as TF Learn (with one space in between TF and Learn). It is available at the following link; and the source code is available on GitHub. TFLearn can be installed in Python 3 with the following command: pip3  install  tflearn Note: To install TFLearn in other environments or from source, please refer to the following link: http://tflearn.org/installation/ The simple workflow in TFLearn is as follows:  Create an input layer first.  Pass the input object to create further layers.  Add the output layer.  Create the net using an estimator layer such as regression.  Create a model from the net created in the previous step.  Train the model with the model.fit() method.  Use the trained model to predict or evaluate. Creating the TFLearn Layers Let us learn how to create the layers of the neural network models in TFLearn:  Create an input layer first: input_layer  =  tflearn.input_data(shape=[None,num_inputs]  Pass the input object to create further layers: layer1  =  tflearn.fully_connected(input_layer,10, activation='relu') layer2  =  tflearn.fully_connected(layer1,10, activation='relu')  Add the output layer: output  =  tflearn.fully_connected(layer2,n_classes, activation='softmax')  Create the final net from the estimator layer such as regression: net  =  tflearn.regression(output, optimizer='adam', metric=tflearn.metrics.Accuracy(), loss='categorical_crossentropy' ) The TFLearn provides several classes for layers that are described in following sub-sections. TFLearn core layers TFLearn offers the following layers in the tflearn.layers.core module: Layer classDescriptioninput_dataThis layer is used to specify the input layer for the neural network.fully_connectedThis layer is used to specify a layer where all the neurons are connected to all the neurons in the previous layer.dropoutThis layer is used to specify the dropout regularization. The input elements are scaled by 1/keep_prob while keeping the expected sum unchanged.Layer classDescriptioncustom_layerThis layer is used to specify a custom function to be applied to the input. This class wraps our custom function and presents the function as a layer.reshapeThis layer reshapes the input into the output of specified shape.flattenThis layer converts the input tensor to a 2D tensor.activationThis layer applies the specified activation function to the input tensor.single_unitThis layer applies the linear function to the inputs.highwayThis layer implements the fully connected highway function.one_hot_encodingThis layer converts the numeric labels to their binary vector one-hot encoded representations.time_distributedThis layer applies the specified function to each time step of the input tensor.multi_target_dataThis layer creates and concatenates multiple placeholders, specifically used when the layers use targets from multiple sources. TFLearn convolutional layers TFLearn offers the following layers in the tflearn.layers.conv module: Layer classDescriptionconv_1dThis layer applies 1D convolutions to the input dataconv_2dThis layer applies 2D convolutions to the input dataconv_3dThis layer applies 3D convolutions to the input dataconv_2d_transposeThis layer applies transpose of conv2_d to the input dataconv_3d_transposeThis layer applies transpose of conv3_d to the input dataatrous_conv_2dThis layer computes a 2-D atrous convolutiongrouped_conv_2dThis layer computes a depth-wise 2-D convolutionmax_pool_1dThis layer computes 1-D max poolingmax_pool_2dThis layer computes 2D max poolingavg_pool_1dThis layer computes 1D average poolingavg_pool_2dThis layer computes 2D average poolingupsample_2dThis layer applies the row and column wise 2-D repeat operationupscore_layerThis layer implements the upscore as specified in http://arxiv. org/abs/1411.4038global_max_poolThis layer implements the global max pooling operationglobal_avg_poolThis layer implements the global average pooling operationresidual_blockThis layer implements the residual block to create deep residual networksresidual_bottleneckThis layer implements the residual bottleneck block for deep residual networksresnext_blockThis layer implements the ResNeXt block TFLearn recurrent layers TFLearn offers the following layers in the tflearn.layers.recurrent module: Layer classDescriptionsimple_rnnThis layer implements the simple recurrent neural network modelbidirectional_rnnThis layer implements the bi-directional RNN modellstmThis layer implements the LSTM modelgruThis layer implements the GRU model TFLearn normalization layers TFLearn offers the following layers in the tflearn.layers.normalization module: Layer classDescriptionbatch_normalizationThis layer normalizes the output of activations of previous layers for each batchlocal_response_normalizationThis layer implements the LR normalizationl2_normalizationThis layer applies the L2 normalization to the input tensors TFLearn embedding layers TFLearn offers only one layer in the tflearn.layers.embedding_ops module: Layer classDescriptionembeddingThis layer implements the embedding function for a sequence of integer IDs or floats TFLearn merge layers TFLearn offers the following layers in the tflearn.layers.merge_ops module: Layer classDescriptionmerge_outputsThis layer merges the list of tensors into a single tensor, generally used to merge the output tensors of the same shapemergeThis layer merges the list of tensors into a single tensor; you can specify the axis along which the merge needs to be done TFLearn estimator layers TFLearn offers only one layer in the tflearn.layers.estimator module: Layer classDescriptionregressionThis layer implements the linear or logistic regression While creating the regression layer, you can specify the optimizer and the loss and metric functions. TFLearn offers the following optimizer functions as classes in the tflearn.optimizers module: SGD RMSprop Adam Momentum AdaGrad Ftrl AdaDelta ProximalAdaGrad Nesterov Note: You can create custom optimizers by extending the tflearn.optimizers.Optimizer base class. TFLearn offers the following metric functions as classes or ops in the tflearn.metrics module: Accuracy or  accuracy_op Top_k or top_k_op R2 or r2_op WeightedR2  or weighted_r2_op Binary_accuracy_op Note : You can create custom metrics by extending the tflearn.metrics.Metric base class. TFLearn provides the following loss functions, known as objectives, in the tflearn.objectives module: Softymax_categorical_crossentropy categorical_crossentropy binary_crossentropy Weighted_crossentropy mean_square hinge_loss roc_auc_score Weak_cross_entropy_2d While specifying the input, hidden, and output layers, you can specify the activation functions to be applied to the output. TFLearn provides the following activation functions in the tflearn.activations module: linear tanh Sigmoid softmax softplus Softsign relu relu6 leaky_relu Prelu elu Crelu selu Creating the TFLearn Model Create the model from the net created in the previous step (step 4 in creating the TFLearn layers section): model  =  tflearn.DNN(net) Types of TFLearn models The TFLearn offers two different classes of the models: DNN  (Deep Neural Network) model: This class allows you to create a multilayer perceptron from the network that you have created from the layers SequenceGenerator model: This class allows you to create a deep neural network that can generate sequences Training the TFLearn Model After creating, train the model with the model.fit() method: model.fit(X_train, Y_train, n_epoch=n_epochs, batch_size=batch_size, show_metric=True, run_id='dense_model') Using the TFLearn Model Use the trained model to predict or evaluate: score  =  model.evaluate(X_test,  Y_test) print('Test  accuracy:',  score[0]) The complete code for the TFLearn MNIST classification example is provided in the notebook ch-02_TF_High_Level_Libraries. The output from the TFLearn MNIST example is as follows: Training  Step:  5499         |  total  loss:  0.42119  |  time:  1.817s |  Adam  |  epoch:  010  |  loss:  0.42119  -  acc:  0.8860  --  iter:  54900/55000 Training  Step:  5500         |  total  loss:  0.40881  |  time:  1.820s |  Adam  |  epoch:  010  |  loss:  0.40881  -  acc:  0.8854  --  iter:  55000/55000 -- Test  accuracy:  0.9029 Note: You can get more information about TFLearn from the following link: http://tflearn.org/. To summarize, we got to know about TFLearn and the different TFLearn layers and models. If you found this post useful, do check out this book Mastering TensorFlow 1.x, to explore advanced features of TensorFlow 1.x, and gain insight into TensorFlow Core, Keras, TF Estimators, TFLearn, TF Slim, Pretty Tensor, and Sonnet. TensorFlow.js 0.11.1 releases! How to Build TensorFlow Models for Mobile and Embedded devices Distributed TensorFlow: Working with multiple GPUs and servers  
Read more
  • 0
  • 0
  • 7219

article-image-data-cleaning-worst-part-of-data-analysis
Amey Varangaonkar
04 Jun 2018
5 min read
Save for later

Data cleaning is the worst part of data analysis, say data scientists

Amey Varangaonkar
04 Jun 2018
5 min read
The year was 2012. Harvard Business Review had famously declared the role of data scientist as the ‘sexiest job of the 21st century’. Companies were slowly working with more data than ever before. The real actionable value of the data that could be used for commercial purposes was slowly beginning to uncover. Someone who could derive these actionable insights from the data was needed. The demand for data scientists was higher than ever. Fast forward to 2018 - more data has been collected in the last 2 years than ever before. Data scientists are still in high demand, and the need for insights is higher than ever. There has been one significant change, though - the process of deriving insights has become more complex. If you ask the data scientists, the first initial phase of this process, which involves data cleansing, has become a lot more cumbersome. So much so, that it is no longer a myth that data scientists spend almost 80% of their time cleaning and readying the data for analysis. Why data cleaning is a nightmare In the recently conducted Packt Skill-Up survey, we asked data professionals what the worst part of the data analysis process was, and a staggering 50% responded with data cleaning. Source: Packt Skill Up Survey We dived deep into this, and tried to understand why many data science professionals have this common feeling of dislike towards data cleaning, or scrubbing - as many call it. Read the Skill Up report in full. Sign up to our weekly newsletter and download the PDF for free. There is no consistent data format Organizations these days work with a lot of data. Some of it is in a structured, readily understandable format. This kind of data is usually quite easy to clean, parse and analyze. However, some of the data is really messy, and cannot be used as is for analysis. This includes missing data, irregularly formatted data, and irrelevant data which is not worth analyzing at all. There is also the problem of working with unstructured data which needs to be pre-processed to get the data worth analyzing. Audio or video files, email messages, presentations, xml documents and web pages are some classic examples of this. There’s too much data to be cleaned The volume of data that businesses deal with on a day to day basis is in the scale of terabytes or even petabytes. Making sense of all this data, coming from a variety of sources and in different formats is, undoubtedly, a huge task. There are a whole host of tools designed to ease this process today, but it remains an incredibly tricky challenge to sift through the large volumes of data and prepare it for analysis. Data cleaning is tricky and time-consuming Data cleansing can be quite an exhaustive and time-consuming task, especially for data scientists. Cleaning the data requires removal of duplications, removing or replacing missing entries, correcting misfielded values, ensuring consistent formatting and a host of other tasks which take a considerable amount of time. Once the data is cleaned, it needs to be placed in a secure location. Also, a log of the entire process needs to be kept to ensure the right data goes through the right process. All of this requires the data scientists to create a well-designed data scrubbing framework to avoid the risk of repetition. All of this is more of a grunt work and requires a lot of manual effort. Sadly, there are no tools in the market which can effectively automate this process. Outsourcing the process is expensive Given that data cleaning is a rather tedious job, many businesses think of outsourcing the task to third party vendors. While this reduces a lot of time and effort on the company’s end, it definitely increases the cost of the overall process. Many small and medium scale businesses may not be able to afford this, and thus are heavily reliant on the data scientist to do the job for them. You can hate it, but you cannot ignore it It is quite obvious that data scientists need clean, ready-to-analyze data if they are to to extract actionable business insights from it. Some data scientists equate data cleaning to donkey work, suggesting there’s not a lot of innovation involved in this process. However, some believe data cleaning is rather important, and pay special attention to it given once it is done right, most of the problems in data analysis are solved. It is very difficult to take advantage of the intrinsic value offered by the dataset if it does not adhere to the quality standards set by the business, making data cleaning a crucial component of the data analysis process. Now that you know why data cleaning is essential, why not dive deeper into the technicalities? Check out our book Practical Data Wrangling for expert tips on turning your noisy data into relevant, insight-ready information using R and Python. Read more Cleaning Data in PDF Files 30 common data science terms explained How to create a strong data science project portfolio that lands you a job  
Read more
  • 0
  • 2
  • 9299

article-image-visualizing-bigquery-data-with-tableau
Sugandha Lahoti
04 Jun 2018
8 min read
Save for later

Visualizing BigQuery Data with Tableau

Sugandha Lahoti
04 Jun 2018
8 min read
Tableau is an interactive data visualization tool that can be used to create business intelligence dashboards. Much like most business intelligence tools, it can be used to pull and manipulate data from a number of sources. The difference is its dedication to help users create insightful data visualizations. Tableau's drag-and-drop interface makes it easy for users to explore data via elegant charts. It also includes an in-memory engine in order to speed up calculations on extremely large data sets. In today’s tutorial, we will be using Tableau Desktop for visualizing BigQuery Data. [box type="note" align="" class="" width=""]This article is an excerpt from the book, Learning Google BigQuery, written by Thirukkumaran Haridass and Eric Brown. This book is a comprehensive guide to mastering Google BigQuery to get intelligent insights from your Big Data.[/box] The following section explains how to use Tableau Desktop Edition to connect to BigQuery and get the data from BigQuery to create visuals: After opening Tableau Desktop, select Google BigQuery under the Connect To a Server section on the left; then enter your login credentials for BigQuery: At this point, all the tables in your dataset should be displayed on the left: You can drag and drop the table you are interested in using to the middle section labeled Drop Tables Here. In this case, we want to query the Google Analytics BigQuery test data, so we will click where it says New Custom SQL and enter the following query in the dialog: SELECT trafficsource.medium as Medium, COUNT(visitId) as Visits FROM `google.com:analytics- bigquery.LondonCycleHelmet.ga_sessions_20130910` GROUP BY Medium Now we can click on Update Now to view the first 10,000 rows of our data. We can also do some simple transformations on our columns, such as changing string values to dates and many others. At the bottom, click on the tab titled Sheet 1 to enter the worksheet view. Tableau's interface allows users to simply drag and drop dimensions and metrics from the left side of the report into the central part to create simple text charts, with a feel much like Excel's pivot chart functionality. This makes Tableau easy to transition to for Excel users. From the Dimensions section on the left-hand-side navigation, drag and drop the Medium dimension into the sheet section. Then drag the Visits metric in the Metric section on the left-hand-side navigation to the Text sub-section in the Marks section. This will create a simple text chart with data from the original query: On the right, click on the button marked Show Me. This should bring up a screen with icons for each graph type that can be created in Tableau: Tableau helps by shading graph types that are not available based on the data that is currently selected in the report. It will also make suggestions based on the data available. In this case, a bar chart has been preselected for us as our data is a text dimension and a numeric metric. Click on the bar chart. Once clicked, the default sideways bar chart will appear with the data we have selected. Click on the Swap Rows and Columns in the icon bar at the top of the screen to flip the chart from horizontal to vertical: Map charts in Tableau One of Tableau's strengths is its ease of use when creating a number of different types of charts. This is true when creating maps, especially because maps can be very painful to create using other tools. Here is the way to create a simple map in Tableau using BigQuery public data. The first few steps are the same as in the preceding example: After opening Tableau Desktop, select Google BigQuery under the Connect To a Server section on the left; then enter your login credentials for BigQuery. At this point, all the tables in your dataset should be displayed on the left-hand side. Click where it says New Custom SQL and enter the following query in the dialog: SELECT zipcode, SUM(population) AS population FROM `bigquery-public- data.census_bureau_usa.population_by_zip_2010` GROUP BY zipcode ORDER BY population desc This data is from the United States Census from 2010. The query returns all zip codes in USA, sorted by most populous to least populous. At the bottom, click on the tab titled Sheet 1 to enter the worksheet view. Double-click on the zipcode dimension on the dimensions section on the left navigation. Clicking on a dimension of zip codes (or any other formatted location dimension such as latitude/longitude, country names, state names, and so on) will automatically create a map in Tableau: Drag the population metric from the metrics section on the left navigation and drop it on the color tab in the marks section: The map will now show the most populous zip codes shaded darker than the less populous zip codes. The map chart also includes zoom features in order to make dealing with large maps easy. In the top-left corner of the map, there is a magnifying glass icon. This icons has the map zoom features. Clicking on the arrow at the bottom of this icon opens more features. The icon with a rectangle and a magnifying glass is the selection tool (The first icon to the right of the arrow when hovering over arrow): Click on this icon and then on the map to select a section of the map to be zoomed into: This image is shown after zooming into the California area of the United States. The map now shows the areas of the state that are the most populous. Create a word cloud in Tableau Word clouds are great visualizations for finding words that are most referenced in books, publications, and social media. This section will cover creating a word cloud in Tableau using BigQuery public data. The first few steps are the same as in the preceding example: After opening Tableau Desktop, select Google BigQuery under the Connect To a Server section on the left; then enter your login credentials for BigQuery. At this point, all the tables in your dataset should be displayed on the left. Click where it says New Custom SQL and enter the following query in the dialog: SELECT word, SUM(word_count) word_count FROM `bigquery-public-data.samples.shakespeare` GROUP BY word ORDER BY word_count desc The dataset is from the works of William Shakespeare. The query returns a list of all words in his works, along with a count of the times each word appears in one of his works. At the bottom, click on the tab titled Sheet 1 to enter the worksheet view. In the dimensions section, drag and drop the word dimension into the text tab in the marks section. In the dimensions section, drag and drop the word_count measure to the size tab in the marks section. There will be two tabs used in the marks section. Right-click on the size tab labeled word and select Measure | Count: This will create what is called a tree map. In this example, there are far too many words in the list to utilize the visualization. Drag and drop the word_count measure from the measures section to the filters section. When prompted with How do you want to filter on word_count, select Sum and click on next.. Select At Least for your condition and type 2000 in the dialog. Click on OK. This will return only those words that have a word count of at least 2,000.. Use the dropdown in the marks card to select Text: 11. Drag and drop the word_count measure from the measures section to the color tab in the marks section. This will color each word based on the count for that word: You should be left with a color-coded word cloud. Other charts can now be created as individual worksheet tabs. Tabs can then be combined to make what Tableau calls a dashboard. The process of creating a dashboard here is a bit more cumbersome than creating a dashboard in Google Data Studio, but Tableau offers a great deal of more customization for its dashboards. This, coupled with all the other features it offers, makes Tableau a much more attractive option, especially for enterprise users. We learnt various features of Tableau and how to use it for visualizing BigQuery data.To know about other third party tools for reporting and visualization purposes such as R and Google Data Studio, check out this book Learning Google BigQuery. Tableau is the most powerful and secure end-to-end analytics platform - Interview Insights Tableau 2018.1 brings new features to help organizations easily scale analytics Getting started with Data Visualization in Tableau      
Read more
  • 0
  • 0
  • 4372
article-image-how-we-think-ai-urge-ai-founding-fathers
Neil Aitken
31 May 2018
9 min read
Save for later

We must change how we think about AI, urge AI founding fathers

Neil Aitken
31 May 2018
9 min read
In Manhattan, nearly 15,000 Taxis make around 30 journeys each, per day. That’s nearly half a million paid trips. The yellow cabs are part of the never ending, slow progression of vehicles which churn through the streets of New York. The good news is, after a century of worsening traffic, congestion is about to be ameliorated, at least to a degree. Researchers at MIT announced this week, that they have developed an algorithm to optimise the way taxis find their customers. Their product is allegedly so efficient, it can reduce the required number of cabs (for now, the ones with human drivers) in Manhattan, by a third. That’s a non trivial improvement. The trick, apparently, is to use the cabs as a hustler might cue the ball in Pool – lining the next pick up to start where the last drop off ended. The technology behind the improvement offered by the MIT research team, is the same one that is behind most of the incredible technology news stories of the last 3 years – Artificial Intelligence. AI is now a part of most of the digital interactions we have. It fuels the recommendation engines in YouTube, Spotify and Netflix. It shows you products you might like in Google’s search results and on Amazon’s homepage. Undoubtedly, AI is the hot topic of the time – as you cannot possibly have failed to notice. How AI was created – and nearly died AI was, until recently, a long forgotten scientific curiosity, employed seriously only in Sci-Fi movies. The technology fell in to a ‘Winter’– a time when AI related projects couldn’t get funding and decision makers had given up on the technology - in the late 1980s. It was at that time that much of the fundamental work which underpins today’s AI, concepts like neural networks and backpropagation were codified. Artificial Intelligence is now enjoying a rebirth. Almost every new idea funded by Venture Capitalists has AI baked in. The potential excites business owners, especially those involved in the technology sphere, and scares governments in equal measure. It offers better profits and the potential for mass unemployment as if they are two sides of the same coin. Is is a one in a generation technology improvement, similar to Air Conditioning, mass produced motor car and the smartphone, in that it can be applied to all aspects of the economy at the same time. Just as the iPhone has propelled telecommunications technology forward, and created billions of dollars of sales for phone companies selling mobile data plans, AI is fueling totally new businesses and making existing operations significantly more efficient. Behind the fanfare associated with AI, however, lies a simple truth. Today’s AI algorithms use what’s called ‘narrow’ or ‘domain specific’ intelligence. In simple terms, each current AI implementation is specific to the job it is given. IBM trained their AI system ‘Watson’, to beat human contestants at ‘Jeopardy!’ When Google want to build an ‘AI product’ that can be used to beat a living counterpart at the Chinese board game ‘Go’, they create a new AI system. And so on. A new task requires a new AI system. Judea Pearl, inventor of Bayesian networks and Turing Awardee On AI systems that can move from predicting what will happen to what will cause something Now, one of the people behind those original concepts from the 1980s, which underpin today’s AI solutions is back with an even bigger idea which might push AI forward. Judea Pearl, Chancellor's professor of computer science and statistics at UCLA, and a distinguished visiting professor at the Technion, Israel Institute of Technology was awarded the Turing Award 30 years ago. This award was given to him for the Bayesian mathematical models, which gave modern AI its strength. Pearl’s fundamental contribution to computer science was in providing the logic and decision making framework for computers to operate under uncertainty. Some say it was he who provided the spark which thawed that AI winter. Today, he laments the current state of AI, concerned that the field has evolved very little in the last 3 decades since his important theory was presented. Pearl likens current AI implementations to simple tools which can tell you what’s likely to come next, based on the recognition of a familiar pattern. For example, a medical AI algorithm might be able to look at X-Rays of a human chest and ‘discern’ that the patient has, or does not have, lung cancer based on patterns it has learnt from its training datasets. The AI in this scenario doesn’t ‘know’ what lung cancer is or what a tumor is. Importantly, it is a very long way from understanding that smoking can cause the affliction. What’s needed in AI next, says Pearl, is a critical difference: AIs which are evolved to the point where they can determine not just what will happen next, but what will cause it. It’s a fundamental improvement, of the same magnitude as his earlier contributions. Causality – what Pearl is proposing - is one of the most basic units of scientific thought and progress. The ability to conduct a repeatable experiment, showing that A caused B, in multiple locations and have independent peers review the results is one of the fundamentals of establishing truth. In his most recent publication, ‘The Book Of Why’,  Pearl outlines how we can get AI, from where it is now, to where it can develop an understanding of these causal relationships. He believes the first step is to cement the building blocks of reality – ‘what is a lung’, ‘what is smoke’ and that we’ll be able to do in the next 10 years. Geoff Hinton, Inventor of backprop and capsule nets On AI which more closely mimics the human brain Geoff Hinton’s was the mind behind backpropagation, another of the fundamental technologies which has brought AI to the point it is at today. To progress AI, however, he says we might have to start all over again. Hinton has developed (and produced two papers for the University of Toronto to articulate) a new way of training AI systems, involving something he calls ‘Capsule Networks’ – a concept he’s been working on for 30 years, in an effort to improve the capabilities of the backpropagation algorithms he developed. Capsule networks operate in a manner similar to the human brain. When we see an image, our brains breaks it down to it’s components and processes them in parallel. Some brain neurons recognise edges through contrast differences. Others look for corners by examining the points at which edges intersect. Capsule Networks are similar, several acting on a picture at one time, identifying, for example, an ear or a nose on an animal, irrespective of the angle from which it is being viewed. This is a big deal as until now, CNNs (convolution neural networks), the set of AI algorithms that are most often used in image and video recognition systems, could recognize images as well as humans do. CNNs, however, find it hard to recognize images if their angle is changed. It’s too early to judge whether capsule networks are the key to the next step in the AI revolution, but in many tasks, Capsule Networks are identifying images faster and more accurately than current capabilities allow. Andrew Ng, Chief Scientist at Baidu On AI that can learn without humans Andrew Ng is the co-inventor of Google Brain, the team and project that Alphabet put together in 2011 to explore Artificial Intelligence. He now works for Baidu, China’s most successful search engine – analogous in size and scope to Google in the rest of the world. At the moment, he heads up Baidu’s Silicon Valley AI research facility. Beyond concerns over potential job displacement caused by AI, an issue so significant he says it is perhaps all we should be thinking about when it comes to Artificial Intelligence, he suggests that, in the future, the most progress will be made when AI systems can team themselves without human involvement. At the moment, training an AI, even on something that, to us is simple, such as what a cat looks like, is a complicated process. The procedure involves ‘supervised learning.’ It’s shown a lot of pictures (when they did this at Google, they used 10 million images), some of which are cats - labelled appropriately by humans. Once a sufficient level of ‘education’ has been undertaken, the AI can then accurately label cats, most of the time. Ng thinks supervision is problematic, he describes it as having an Achilles heel in the form of the quantity of data that is required. To go beyond current capabilities, says Ng, will require a completely new type of technology – one which can learn through ‘unsupervised learning’ -  machines learning from data that has not been classified by humans. Progress on unsupervised learning is slow. At both Baidu and Google, engineers are focussing on constrained versions of unsupervised learning such as training AI systems to learn about a human face and then using them to create a face themselves. The activity requires that the AI develops what we would call an ‘internal representation’ of a face – something which is required in any unsupervised learning. Other avenues to train without supervision include, ingeniously, pitting an AI system against a computer game – an environment in which they receive feedback (through points awarded in the game) for ‘constructive’ activities, but within which they are not taught directly by a human. Next generation AI depends on ‘scrubbing away’ existing assumptions Artificial Intelligence, as it stands will deliver economy wide efficiency improvements, the likes of which we have not seen in decades. It seems incredible to think that the field is still in its infancy when it can deliver such substantial benefits – like reduced traffic congestion, lower carbon emissions and saved time in New York Taxis. But it is. Isaac Azimov who developed his own concepts behind how Artificial Intelligence might be trained with simple rules said “Your assumptions are your windows on the world. Scrub them off every once in a while, or the light won't come in.” The author should rest assured. Between them, Pearl, Hinton and Ng are each taking revolutionary approaches to elevate AI beyond even the incredible heights it has reached, and starting without reference to the concepts which have brought us this far. 5 polarizing Quotes from Professor Stephen Hawking on artificial intelligence Toward Safe AI – Maximizing your control over Artificial Intelligence Decoding the Human Brain for Artificial Intelligence to make smarter decisions
Read more
  • 0
  • 0
  • 2717

article-image-how-to-build-deep-convolutional-gan-using-tensorflow-and-keras
Savia Lobo
29 May 2018
13 min read
Save for later

How to build Deep convolutional GAN using TensorFlow and Keras

Savia Lobo
29 May 2018
13 min read
In this tutorial, we will learn to build both simple and deep convolutional GAN models with the help of TensorFlow and Keras deep learning frameworks. [box type="note" align="" class="" width=""]This article is an excerpt taken from the book Mastering TensorFlow 1.x written by Armando Fandango.[/box] Simple GAN with TensorFlow For building the GAN with TensorFlow, we build three networks, two discriminator models, and one generator model with the following steps: Start by adding the hyper-parameters for defining the network: # graph hyperparameters g_learning_rate = 0.00001 d_learning_rate = 0.01 n_x = 784 # number of pixels in the MNIST image # number of hidden layers for generator and discriminator g_n_layers = 3 d_n_layers = 1 # neurons in each hidden layer g_n_neurons = [256, 512, 1024] d_n_neurons = [256] # define parameter ditionary d_params = {} g_params = {} activation = tf.nn.leaky_relu w_initializer = tf.glorot_uniform_initializer b_initializer = tf.zeros_initializer Next, define the generator network: z_p = tf.placeholder(dtype=tf.float32, name='z_p', shape=[None, n_z]) layer = z_p # add generator network weights, biases and layers with tf.variable_scope('g'): for i in range(0, g_n_layers): w_name = 'w_{0:04d}'.format(i) g_params[w_name] = tf.get_variable( name=w_name, shape=[n_z if i == 0 else g_n_neurons[i - 1], g_n_neurons[i]], initializer=w_initializer()) b_name = 'b_{0:04d}'.format(i) g_params[b_name] = tf.get_variable( name=b_name, shape=[g_n_neurons[i]], initializer=b_initializer()) layer = activation( tf.matmul(layer, g_params[w_name]) + g_params[b_name]) # output (logit) layer i = g_n_layers w_name = 'w_{0:04d}'.format(i) g_params[w_name] = tf.get_variable( name=w_name, shape=[g_n_neurons[i - 1], n_x], initializer=w_initializer()) b_name = 'b_{0:04d}'.format(i) g_params[b_name] = tf.get_variable( name=b_name, shape=[n_x], initializer=b_initializer()) g_logit = tf.matmul(layer, g_params[w_name]) + g_params[b_name] g_model = tf.nn.tanh(g_logit) Next, define the weights and biases for the two discriminator networks that we shall build: with tf.variable_scope('d'): for i in range(0, d_n_layers): w_name = 'w_{0:04d}'.format(i) d_params[w_name] = tf.get_variable( name=w_name, shape=[n_x if i == 0 else d_n_neurons[i - 1], d_n_neurons[i]], initializer=w_initializer()) b_name = 'b_{0:04d}'.format(i) d_params[b_name] = tf.get_variable( name=b_name, shape=[d_n_neurons[i]], initializer=b_initializer()) #output (logit) layer i = d_n_layers w_name = 'w_{0:04d}'.format(i) d_params[w_name] = tf.get_variable( name=w_name, shape=[d_n_neurons[i - 1], 1], initializer=w_initializer()) b_name = 'b_{0:04d}'.format(i) d_params[b_name] = tf.get_variable( name=b_name, shape=[1], initializer=b_initializer()) Now using these parameters, build the discriminator that takes the real images as input and outputs the classification: # define discriminator_real # input real images x_p = tf.placeholder(dtype=tf.float32, name='x_p', shape=[None, n_x]) layer = x_p with tf.variable_scope('d'): for i in range(0, d_n_layers): w_name = 'w_{0:04d}'.format(i) b_name = 'b_{0:04d}'.format(i) layer = activation( tf.matmul(layer, d_params[w_name]) + d_params[b_name]) layer = tf.nn.dropout(layer,0.7) #output (logit) layer i = d_n_layers w_name = 'w_{0:04d}'.format(i) b_name = 'b_{0:04d}'.format(i) d_logit_real = tf.matmul(layer, d_params[w_name]) + d_params[b_name] d_model_real = tf.nn.sigmoid(d_logit_real)  Next, build another discriminator network, with the same parameters, but providing the output of generator as input: # define discriminator_fake # input generated fake images z = g_model layer = z with tf.variable_scope('d'): for i in range(0, d_n_layers): w_name = 'w_{0:04d}'.format(i) b_name = 'b_{0:04d}'.format(i) layer = activation( tf.matmul(layer, d_params[w_name]) + d_params[b_name]) layer = tf.nn.dropout(layer,0.7) #output (logit) layer i = d_n_layers w_name = 'w_{0:04d}'.format(i) b_name = 'b_{0:04d}'.format(i) d_logit_fake = tf.matmul(layer, d_params[w_name]) + d_params[b_name] d_model_fake = tf.nn.sigmoid(d_logit_fake) Now that we have the three networks built, the connection between them is made using the loss, optimizer and training functions. While training the generator, we only train the generator's parameters and while training the discriminator, we only train the discriminator's parameters. We specify this using the var_list parameter to the optimizer's minimize() function. Here is the complete code for defining the loss, optimizer and training function for both kinds of network: g_loss = -tf.reduce_mean(tf.log(d_model_fake)) d_loss = -tf.reduce_mean(tf.log(d_model_real) + tf.log(1 - d_model_fake)) g_optimizer = tf.train.AdamOptimizer(g_learning_rate) d_optimizer = tf.train.GradientDescentOptimizer(d_learning_rate) g_train_op = g_optimizer.minimize(g_loss, var_list=list(g_params.values())) d_train_op = d_optimizer.minimize(d_loss, var_list=list(d_params.values()))  Now that we have defined the models, we have to train the models. The training is done as per the following algorithm: For each epoch: For each batch: get real images x_batch generate noise z_batch train discriminator using z_batch and x_batch generate noise z_batch train generator using z_batch The complete code for training from the notebook is as follows: n_epochs = 400 batch_size = 100 n_batches = int(mnist.train.num_examples / batch_size) n_epochs_print = 50 with tf.Session() as tfs: tfs.run(tf.global_variables_initializer()) for epoch in range(n_epochs): epoch_d_loss = 0.0 epoch_g_loss = 0.0 for batch in range(n_batches): x_batch, _ = mnist.train.next_batch(batch_size) x_batch = norm(x_batch) z_batch = np.random.uniform(-1.0,1.0,size=[batch_size,n_z]) feed_dict = {x_p: x_batch,z_p: z_batch} _,batch_d_loss = tfs.run([d_train_op,d_loss], feed_dict=feed_dict) z_batch = np.random.uniform(-1.0,1.0,size=[batch_size,n_z]) feed_dict={z_p: z_batch} _,batch_g_loss = tfs.run([g_train_op,g_loss], feed_dict=feed_dict) epoch_d_loss += batch_d_loss epoch_g_loss += batch_g_loss if epoch%n_epochs_print == 0: average_d_loss = epoch_d_loss / n_batches average_g_loss = epoch_g_loss / n_batches print('epoch: {0:04d} d_loss = {1:0.6f} g_loss = {2:0.6f}' .format(epoch,average_d_loss,average_g_loss)) # predict images using generator model trained x_pred = tfs.run(g_model,feed_dict={z_p:z_test}) display_images(x_pred.reshape(-1,pixel_size,pixel_size)) We printed the generated images every 50 epochs: As we can see the generator was producing just noise in epoch 0, but by epoch 350, it got trained to produce much better shapes of handwritten digits. You can try experimenting with epochs, regularization, network architecture and other hyper-parameters to see if you can produce even faster and better results. Simple GAN with Keras Now let us implement the same model in Keras:  The hyper-parameter definitions remain the same as the last section: # graph hyperparameters g_learning_rate = 0.00001 d_learning_rate = 0.01 n_x = 784 # number of pixels in the MNIST image # number of hidden layers for generator and discriminator g_n_layers = 3 d_n_layers = 1 # neurons in each hidden layer g_n_neurons = [256, 512, 1024] d_n_neurons = [256]  Next, define the generator network: # define generator g_model = Sequential() g_model.add(Dense(units=g_n_neurons[0], input_shape=(n_z,), name='g_0')) g_model.add(LeakyReLU()) for i in range(1,g_n_layers): g_model.add(Dense(units=g_n_neurons[i], name='g_{}'.format(i) )) g_model.add(LeakyReLU()) g_model.add(Dense(units=n_x, activation='tanh',name='g_out')) print('Generator:') g_model.summary() g_model.compile(loss='binary_crossentropy', optimizer=keras.optimizers.Adam(lr=g_learning_rate) ) This is what the generator model looks like: In the Keras example, we do not define two discriminator networks as we defined in the TensorFlow example. Instead, we define one discriminator network and then stitch the generator and discriminator network into the GAN network. The GAN network is then used to train the generator parameters only, and the discriminator network is used to train the discriminator parameters: # define discriminator d_model = Sequential() d_model.add(Dense(units=d_n_neurons[0], input_shape=(n_x,), name='d_0' )) d_model.add(LeakyReLU()) d_model.add(Dropout(0.3)) for i in range(1,d_n_layers): d_model.add(Dense(units=d_n_neurons[i], name='d_{}'.format(i) )) d_model.add(LeakyReLU()) d_model.add(Dropout(0.3)) d_model.add(Dense(units=1, activation='sigmoid',name='d_out')) print('Discriminator:') d_model.summary() d_model.compile(loss='binary_crossentropy', optimizer=keras.optimizers.SGD(lr=d_learning_rate) ) This is what the discriminator models look: Discriminator: _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= d_0 (Dense) (None, 256) 200960 _________________________________________________________________ leaky_re_lu_4 (LeakyReLU) (None, 256) 0 _________________________________________________________________ dropout_1 (Dropout) (None, 256) 0 _________________________________________________________________ d_out (Dense) (None, 1) 257 ================================================================= Total params: 201,217 Trainable params: 201,217 Non-trainable params: 0 _________________________________________________________________ Next, define the GAN Network, and turn the trainable property of the discriminator model to false, since GAN would only be used to train the generator: # define GAN network d_model.trainable=False z_in = Input(shape=(n_z,),name='z_in') x_in = g_model(z_in) gan_out = d_model(x_in) gan_model = Model(inputs=z_in,outputs=gan_out,name='gan') print('GAN:') gan_model.summary() gan_model.compile(loss='binary_crossentropy', optimizer=keras.optimizers.Adam(lr=g_learning_rate) ) This is what the GAN model looks: GAN: _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= z_in (InputLayer) (None, 256) 0 _________________________________________________________________ sequential_1 (Sequential) (None, 784) 1526288 _________________________________________________________________ sequential_2 (Sequential) (None, 1) 201217 ================================================================= Total params: 1,727,505 Trainable params: 1,526,288 Non-trainable params: 201,217 _________________________________________________________________  Great, now that we have defined the three models, we have to train the models. The training is as per the following algorithm: For each epoch: For each batch: get real images x_batch generate noise z_batch generate images g_batch using generator model combine g_batch and x_batch into x_in and create labels y_out set discriminator model as trainable train discriminator using x_in and y_out generate noise z_batch set x_in = z_batch and labels y_out = 1 set discriminator model as non-trainable train gan model using x_in and y_out, (effectively training generator model) For setting the labels, we apply the labels as 0.9 and 0.1 for real and fake images respectively. Generally, it is suggested that you use label smoothing by picking a random value from 0.0 to 0.3 for fake data and 0.8 to 1.0 for real data. Here is the complete code for training from the notebook: n_epochs = 400 batch_size = 100 n_batches = int(mnist.train.num_examples / batch_size) n_epochs_print = 50 for epoch in range(n_epochs+1): epoch_d_loss = 0.0 epoch_g_loss = 0.0 for batch in range(n_batches): x_batch, _ = mnist.train.next_batch(batch_size) x_batch = norm(x_batch) z_batch = np.random.uniform(-1.0,1.0,size=[batch_size,n_z]) g_batch = g_model.predict(z_batch) x_in = np.concatenate([x_batch,g_batch]) y_out = np.ones(batch_size*2) y_out[:batch_size]=0.9 y_out[batch_size:]=0.1 d_model.trainable=True batch_d_loss = d_model.train_on_batch(x_in,y_out) z_batch = np.random.uniform(-1.0,1.0,size=[batch_size,n_z]) x_in=z_batch y_out = np.ones(batch_size) d_model.trainable=False batch_g_loss = gan_model.train_on_batch(x_in,y_out) epoch_d_loss += batch_d_loss epoch_g_loss += batch_g_loss if epoch%n_epochs_print == 0: average_d_loss = epoch_d_loss / n_batches average_g_loss = epoch_g_loss / n_batches print('epoch: {0:04d} d_loss = {1:0.6f} g_loss = {2:0.6f}' .format(epoch,average_d_loss,average_g_loss)) # predict images using generator model trained x_pred = g_model.predict(z_test) display_images(x_pred.reshape(-1,pixel_size,pixel_size)) We printed the results every 50 epochs, up to 350 epochs: The model slowly learns to generate good quality images of handwritten digits from the random noise. There are so many variations of the GANs that it will take another book to cover all the different kinds of GANs. However, the implementation techniques are almost similar to what we have shown here. Deep Convolutional GAN with TensorFlow and Keras In DCGAN, both the discriminator and generator are implemented using a Deep Convolutional Network: 1.  In this example, we decided to implement the generator as the following network: Generator: _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= g_in (Dense) (None, 3200) 822400 _________________________________________________________________ g_in_act (Activation) (None, 3200) 0 _________________________________________________________________ g_in_reshape (Reshape) (None, 5, 5, 128) 0 _________________________________________________________________ g_0_up2d (UpSampling2D) (None, 10, 10, 128) 0 _________________________________________________________________ g_0_conv2d (Conv2D) (None, 10, 10, 64) 204864 _________________________________________________________________ g_0_act (Activation) (None, 10, 10, 64) 0 _________________________________________________________________ g_1_up2d (UpSampling2D) (None, 20, 20, 64) 0 _________________________________________________________________ g_1_conv2d (Conv2D) (None, 20, 20, 32) 51232 _________________________________________________________________ g_1_act (Activation) (None, 20, 20, 32) 0 _________________________________________________________________ g_2_up2d (UpSampling2D) (None, 40, 40, 32) 0 _________________________________________________________________ g_2_conv2d (Conv2D) (None, 40, 40, 16) 12816 _________________________________________________________________ g_2_act (Activation) (None, 40, 40, 16) 0 _________________________________________________________________ g_out_flatten (Flatten) (None, 25600) 0 _________________________________________________________________ g_out (Dense) (None, 784) 20071184 ================================================================= Total params: 21,162,496 Trainable params: 21,162,496 Non-trainable params: 0 The generator is a stronger network having three convolutional layers followed by tanh activation. We define the discriminator network as follows: Discriminator: _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= d_0_reshape (Reshape) (None, 28, 28, 1) 0 _________________________________________________________________ d_0_conv2d (Conv2D) (None, 28, 28, 64) 1664 _________________________________________________________________ d_0_act (Activation) (None, 28, 28, 64) 0 _________________________________________________________________ d_0_maxpool (MaxPooling2D) (None, 14, 14, 64) 0 _________________________________________________________________ d_out_flatten (Flatten) (None, 12544) 0 _________________________________________________________________ d_out (Dense) (None, 1) 12545 ================================================================= Total params: 14,209 Trainable params: 14,209 Non-trainable params: 0 _________________________________________________________________  The GAN network is composed of the discriminator and generator as demonstrated previously: GAN: _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= z_in (InputLayer) (None, 256) 0 _________________________________________________________________ g (Sequential) (None, 784) 21162496 _________________________________________________________________ d (Sequential) (None, 1) 14209 ================================================================= Total params: 21,176,705 Trainable params: 21,162,496 Non-trainable params: 14,209 _________________________________________________________________ When we run this model for 400 epochs, we get the following output: As you can see, the DCGAN is able to generate high-quality digits starting from epoch 100 itself. The DGCAN has been used for style transfer, generation of images and titles and for image algebra, namely taking parts of one image and adding that to parts of another image. We built a simple GAN in TensorFlow and Keras and applied it to generate images from the MNIST dataset. We also built a DCGAN where the generator and discriminator consisted of convolutional networks. Do check out the book Mastering TensorFlow 1.x  to explore advanced features of TensorFlow 1.x and obtain in-depth knowledge of TensorFlow for solving artificial intelligence problems. 5 reasons to learn Generative Adversarial Networks (GANs) in 2018 Implementing a simple Generative Adversarial Network (GANs) Getting to know Generative Models and their types
Read more
  • 0
  • 0
  • 5800

article-image-optimize-mysql-8-servers-clients
Amey Varangaonkar
28 May 2018
11 min read
Save for later

How to optimize MySQL 8 servers and clients

Amey Varangaonkar
28 May 2018
11 min read
Our article focuses on optimization for MySQL 8 database servers and clients, we start with optimizing the server, followed by optimizing MySQL 8 client-side entities. It is more relevant to database administrators, to ensure performance and scalability across multiple servers. It would also help developers prepare scripts (which includes setting up the database) and users run MySQL for development and testing to maximize the productivity. [box type="note" align="" class="" width=""]The following excerpt is taken from the book MySQL 8 Administrator’s Guide, written by Chintan Mehta, Ankit Bhavsar, Hetal Oza and Subhash Shah. In this book, authors have presented hands-on techniques for tackling the common and not-so-common issues when it comes to the different administration-related tasks in MySQL 8.[/box] Optimizing disk I/O There are quite a few ways to configure storage devices to devote more and faster storage hardware to the database server. A major performance bottleneck is disk seeking (finding the correct place on the disk to read or write content). When the amount of data grows large enough to make caching impossible, the problem with disk seeds becomes apparent. We need at least one disk seek operation to read, and several disk seek operations to write things in large databases where the data access is done more or less randomly. We should regulate or minimize the disk seek times using appropriate disks. In order to resolve the disk seek performance issue, increasing the number of available disk spindles, symlinking the files to different disks, or stripping disks can be done. The following are the details: Using symbolic links: When using symbolic links, we can create a Unix symbolic links for index and data files. The symlink points from default locations in the data directory to another disk in the case of MyISAM tables. These links may also be striped. This improves the seek and read times. The assumption is that the disk is not used concurrently for other purposes. Symbolic links are not supported for InnoDB tables. However, we can place InnoDB data and log files on different physical disks. Striping: In striping, we have many disks. We put the first block on the first disk, the second block on the second disk, and so on. The N block on the (N % number of-disks) disk. If the stripe size is perfectly aligned, the normal data size will be less than the stripe size. This will help to improve the performance. Striping is dependent on the stripe size and the operating system. In an ideal case, we would benchmark the application with different stripe sizes. The speed difference while striping depends on the parameters we have used, like stripe size. The difference in performance also depends on the number of disks. We have to choose if we want to optimize for random access or sequential access. To gain reliability, we may decide to set up with striping and mirroring (RAID 0+1). RAID stands for Redundant Array of Independent Drives. This approach needs 2 x N drives to hold N drives of data. With a good volume management software, we can manage this setup efficiently. There is another approach to it, as well. Depending on how critical the type of data is, we may vary the RAID level. For example, we can store really important data, such as host information and logs, on a RAID 0+1 or RAID N disk, whereas we can store semi-important data on a RAID 0 disk. In the case of RAID, parity bits are used to ensure the integrity of the data stored on each drive. So, RAID N becomes a problem if we have too many write operations to be performed. The time required to update the parity bits in this case is high. If it is not important to maintain when the file was last accessed, we can mount the file system with the -o noatime option. This option skips the updates on the file system, which reduces the disk seek time. We can also make the file system update asynchronously. Depending upon whether the file system supports it, we can set the -o async option. Using Network File System (NFS) with MySQL While using a Network File System (NFS), varying issues may occur, depending on the operating system and the NFS version. The following are the details: Data inconsistency is one issue with an NFS system. It may occur because of messages received out of order or lost network traffic. We can use TCP with hard and intr mount options to avoid these issues. MySQL data and log files may get locked and become unavailable for use if placed on NFS drives. If multiple instances of MySQL access the same data directory, it may result in locking issues. Improper shut down of MySQL or power outage are other reasons for filesystem locking issues. The latest version of NFS supports advisory and lease-based locking, which helps in addressing the locking issues. Still, it is not recommended to share a data directory among multiple MySQL instances. Maximum file size limitations must be understood to avoid any issues. With NFS 2, only the lower 2 GB of a file is accessible by clients. NFS 3 clients support larger files. The maximum file size depends on the local file system of the NFS server. Optimizing the use of memory In order to improve the performance of database operations, MySQL allocates buffers and caches memory. As a default, the MySQL server starts on a virtual machine (VM) with 512 MB of RAM. We can modify the default configuration for MySQL to run on limited memory systems. The following list describes the ways to optimize MySQL memory: The memory area which holds cached InnoDB data for tables, indexes, and other auxiliary buffers is known as the InnoDB buffer pool. The buffer pool is divided into pages. The pages hold multiple rows. The buffer pool is implemented as a linked list of pages for efficient cache management. Rarely used data is removed from the cache using an algorithm. Buffer pool size is an important factor for system performance. The innodb__buffer_pool_size system variable defines the buffer pool size. InnoDB allocates the entire buffer pool size at server startup. 50 to 75 percent of system memory is recommended for the buffer pool size. With MyISAM, all threads share the key buffer. The key_buffer_size system variable defines the size of the key buffer. The index file is opened once for each MyISAM table opened by the server. For each concurrent thread that accesses the table, the data file is opened once. A table structure, column structures for each column, and a 3 x N sized buffer are allocated for each concurrent thread. The MyISAM storage engine maintains an extra row buffer for internal use. The optimizer estimates the reading of multiple rows by scanning. The storage engine interface enables the optimizer to provide information about the recorded buffer size. The size of the buffer can vary depending on the size of the estimate. In order to take advantage of row pre-fetching, InnoDB uses a variable size buffering capability. It reduces the overhead of latching and B-tree navigation. Memory mapping can be enabled for all MyISAM tables by setting the myisam_use_mmap system variable to 1. The size of an in-memory temporary table can be defined by the tmp_table_size system variable. The maximum size of the heap table can be defined using the max_heap_table_size system variable. If the in-memory table becomes too large, MySQL automatically converts the table from in-memory to on-disk. The storage engine for an on-disk temporary table is defined by the internal_tmp_disk_storage_engine system variable. MySQL comes with the MySQL performance schema. It is a feature to monitor MySQL execution at low levels. The performance schema dynamically allocates memory by scaling its memory use to the actual server load, instead of allocating memory upon server startup. The memory, once allocated, is not freed until the server is restarted. Thread specific space is required for each thread that the server uses to manage client connections. The stack size is governed by the thread_stack system variable. The connection buffer is governed by the net_buffer_length system variable. A result buffer is governed by net_buffer_length. The connection buffer and result buffer starts with net_buffer_length bytes, but enlarges up to max_allowed_packets bytes, as needed. All threads share the same base memory. All join clauses are executed in a single pass. Most of the joins can be executed without a temporary table. Temporary tables are memory-based hash tables. Temporary tables that contain BLOB data and tables with large row lengths are stored on disk. A read buffer is allocated for each request, which performs a sequential scan on a table. The size of the read buffer is determined by the read_buffer_size system variable. MySQL closes all tables that are not in use at once when FLUSH TABLES or mysqladmin flush-table commands are executed. It marks all in-use tables to be closed when the current thread execution finishes. This frees in-use memory. FLUSH TABLES returns only after all tables have been closed. It is possible to monitor the MySQL performance schema and sys schema for memory usage. Before we can execute commands for this, we have to enable memory instruments on the MySQL performance schema. It can be done by updating the ENABLED column of the performance schema setup_instruments table. The following is the query to view available memory instruments in MySQL: mysql> SELECT * FROM performance_schema.setup_instruments WHERE NAME LIKE '%memory%'; This query will return hundreds of memory instruments. We can narrow it down by specifying a code area. The following is an example to limit results to InnoDB memory instruments: mysql> SELECT * FROM performance_schema.setup_instruments WHERE NAME LIKE '%memory/innodb%'; The following is the configuration to enable memory instruments: performance-schema-instrument='memory/%=COUNTED' The following is an example to query memory instrument data in the memory_summary_global_by_event_name table in the performance schema: mysql> SELECT * FROM performance_schema.memory_summary_global_by_event_name WHERE EVENT_NAME LIKE 'memory/innodb/buf_buf_pool'G; EVENT_NAME: memory/innodb/buf_buf_pool COUNT_ALLOC: 1 COUNT_FREE: 0 SUM_NUMBER_OF_BYTES_ALLOC: 137428992 SUM_NUMBER_OF_BYTES_FREE: 0 LOW_COUNT_USED: 0 CURRENT_COUNT_USED: 1 HIGH_COUNT_USED: 1 LOW_NUMBER_OF_BYTES_USED: 0 CURRENT_NUMBER_OF_BYTES_USED: 137428992 HIGH_NUMBER_OF_BYTES_USED: 137428992 It summarizes data by EVENT_NAME. The following is an example of querying the sys schema to aggregate currently allocated memory by code area: mysql> SELECT SUBSTRING_INDEX(event_name,'/',2) AS code_area, sys.format_bytes(SUM(current_alloc)) AS current_alloc FROM sys.x$memory_global_by_current_bytes GROUP BY SUBSTRING_INDEX(event_name,'/',2) ORDER BY SUM(current_alloc) DESC; Performance benchmarking We must consider the following factors when measuring performance: While measuring the speed of a single operation or a set of operations, it is important to simulate a scenario in the case of a heavy database workload for benchmarking In different environments, the test results may be different Depending on the workload, certain MySQL features may not help with performance MySQL 8 supports measuring the performance of individual statements. If we want to measure the speed of any SQL expression or function, the BENCHMARK() function is used. The following is the syntax for the function: BENCHMARK(loop_count, expression) The output of the BENCHMARK function is always zero. The speed can be measured by the line printed by MySQL in the output. The following is an example: mysql> select benchmark(1000000, 1+1); From the preceding example , we can find that the time taken to calculate 1+1 for 1000000 times is 0.15 seconds. Other aspects involved in optimizing MySQL servers and clients include optimizing locking operations, examining thread information and more. To know about these techniques, you may check out the book MySQL 8 Administrator’s Guide. SQL Server recovery models to effectively backup and restore your database Get SQL Server user management right 4 Encryption options for your SQL Server
Read more
  • 0
  • 0
  • 8095
article-image-use-m-functions-within-power-bi-querying-data
Amarabha Banerjee
21 May 2018
10 min read
Save for later

How to use M functions within Microsoft Power BI for querying data

Amarabha Banerjee
21 May 2018
10 min read
Microsoft Power BI Desktop contains a rich set of data source connectors and transformation capabilities that support the integration and enhancement of source data. These features are all driven by a powerful functional language and query engine, M, which leverages source system resources when possible and can greatly extend the scope and robustness of the data retrieval process beyond the possibilities of the standard query editor interface alone. As with almost all BI projects, the design and development of the data access and retrieval process has great implications for the analytical value, scalability, and sustainability of the overall Power BI solution. [box type="note" align="" class="" width=""]Our article is an excerpt from the book Microsoft Power BI Cookbook, written by Brett Powell. This book shows how to leverage  Microsoft Power BI and the development tools to create better data driven analytics and visualizations. [/box] In this article, we dive into Power BI Desktop's Get Data experience and go through the process of establishing and managing data source connections and queries. Examples are provided of using the Query Editor interface and the M language directly to construct and refine queries to meet common data transformation and cleansing needs. In practice and as per the examples, a combination of both tools is recommended to aid the query development process. Viewing and analyzing M functions Every time you click on a button to connect to any of Power BI Desktop's supported data sources or apply any transformation to a data source object, such as changing a column's data type, one or multiple M expressions are created reflecting your choices. These M expressions are automatically written to dedicated M documents and, if saved, are stored within the Power BI Desktop file as Queries. M is a functional programming language like F#, and it's important that Power BI developers become familiar with analyzing and later writing and enhancing the M code that supports their queries. Getting ready Build a query through the user interface that connects to the AdventureWorksDW2016CTP3 SQL Server database on the ATLAS server and retrieves the DimGeography table, filtered by United States for English. Click on Get Data from the Home tab of the ribbon, select SQL Server from the list of database sources, and provide the server and database names. For the Data Connectivity mode, select Import. A navigation window will appear, with the different objects and schemas of the database. Select the DimGeography table from the Navigation window and click on Edit. In the Query Editor window, select the EnglishCountryRegionName column and then filter on United States from its dropdown. Figure 2: Filtering for United States only in the Query Editor At this point, a preview of the filtered table is exposed in the Query Editor and the Query Settings pane displays the previous steps. Figure 3: The Query Settings pane in the Query Editor How to do it Formula Bar With the Formula Bar visible in the Query Editor, click on the Source step under Applied Steps in the Query Settings pane. You should see the following formula expression: Figure 4: The SQL.Database() function created for the Source step Click on the Navigation step to expose the following expression: Figure 5: The metadata record created for the Navigation step The navigation expression (2) references the source expression (1) The Formula Bar in the Query Editor displays individual query steps, which are technically individual M expressions It's convenient and very often essential to view and edit all the expressions in a centralized window, and for this, there's the Advanced Editor M is a functional language, and it can be useful to think of query evaluation in M as similar to Excel spreadsheet formulas in which multiple formulas can reference each other. The M engine can determine which expressions are required by the final expression to return and evaluate only those expressions. Configuring Power BI Development Tools, the display setting for both the Query Settings pane and the Formula bar should be enabled as GLOBAL | Query Editor options. Figure 6: Global layout options for the Query Editor Alternatively, on a per file basis, you can control these settings and others from the View tab of the Query Editor toolbar. Figure 7: Property settings of the View tab in the Query Editor Advanced Editor window Given its importance to the query development process, the Advanced Editor dialog is exposed on both the Home and View tabs of the Query Editor. It's recommended to use the Query Editor when getting started with a new query and when learning the M language. After several steps have been applied, use the Advanced Editor to review and optionally enhance or customize the M query. As a rich, functional programming language, there are many M functions and optional parameters not exposed via the Query Editor; going beyond the limits of the Query Editor enables more robust data retrieval and integration processes. Figure 8: The Home tab of the Query Editor Click on Advanced Editor from either the View or Home tabs (Figure 8 and Figure 9, respectively). All M function expressions and any comments are exposed Figure 9: The Advanced Editor view of the DimGeography query When developing retrieval processes for Power BI models, consider these common ETL questions: How are our queries impacting the source systems? Can we make our retrieval queries more resilient to changes in source data such that they avoid failure? Is our retrieval process efficient and simple to follow and support or are there unnecessary steps and queries? Are our retrieval queries delivering sufficient performance to the BI application? Is our process flexible such that we can quickly apply changes to data sources and logic? M queries are not intended as a substitute for the workloads typically handled by enterprise ETL tools such as SSIS or Informatica. However, just as BI professionals would carefully review the logic and test the performance of SQL stored procedures and ETL packages supporting their various cubes and reports environment, they should also review the M queries created to support Power BI models and reports. How it works Two of the top performance and scalability features of M's engine are Query Folding and Lazy Evaluation. If possible, the M queries developed in Power BI Desktop are converted (folded) into SQL statements and passed to source systems for processing. M can also reduce the required resources for a given query by ignoring any unnecessary or redundant steps (variables). M is a case-sensitive language. This includes referencing variables in M expressions (RenameColumns versus Renamecolumns) as well as the values in M queries. For example, the values "Apple" and "apple" are considered unique values in an M query; the Table.Distinct() function will not remove rows for one of the values. Variable names in M expressions cannot have spaces without a hash sign and double quotes. Per Figure 10, when the Query Editor graphical interface is used to create M queries this syntax is applied automatically, along with a name describing the M transformation applied. Applying short, descriptive variable names (with no spaces) improves the readability of M queries.  Query folding The query from this recipe was "folded" into the following SQL statement and sent to the ATLAS server for processing. Figure 10: The SQL statement generated from the DimGeography M query Right-click on the Filtered Rows step and select View Native Query to access the Native Query window from Figure 11: Figure 11: View Native Query in Query Settings Finding and revising queries that are not being folded to source systems is a top technique for enhancing large Power BI datasets. See the Pushing Query Processing Back to Source Systems recipe of Chapter 11, Enhancing and Optimizing Existing Power BI Solutions for an example of this process. M query structure The great majority of queries created for Power BI will follow the let...in structure as per this recipe, as they contain multiple steps with dependencies among them. Individual expressions are separated by commas. The expression referred to following the in keyword is the expression returned by the query. The individual step expressions are technically "variables", and if the identifiers for these variables (the names of the query steps) contain spaces then the step is placed in quotes, and prefixed with a # sign as per the Filtered Rows step in Figure 10. Lazy evaluation The M engine also has powerful "lazy evaluation" logic for ignoring any redundant or unnecessary variables, as well as short-circuiting evaluation (computation) once a result is determinate, such as when one side (operand) of an OR logical operator is computed as True. The order of evaluation of the expressions is determined at runtime; it doesn't have to be sequential from top to bottom. In the following example, a step for retrieving Canada was added and the step for the United States was ignored. Since the CanadaOnly variable satisfies the overall let expression of the query, only the Canada query is issued to the server as if the United States row were commented out or didn't exist. Figure 12: Revised query that ignores Filtered Rows step to evaluate Canada only View Native Query (Figure 12) is not available given this revision, but a SQL Profiler trace against the source database server (and a refresh of the M query) confirms that CanadaOnly was the only SQL query passed to the source database. Figure 13: Capturing the SQL statement passed to the server via SQL Server Profiler trace There's more Partial query folding A query can be "partially folded", in which a SQL statement is created resolving only part of an overall query The results of this SQL statement would be returned to Power BI Desktop (or the on-premises data gateway) and the remaining logic would be computed using M's in-memory engine with local resources M queries can be designed to maximize the use of the source system resources, by using standard expressions supported by query folding early in the query process Minimizing the use of local or on-premises data gateway resources is a top consideration Limitations of query folding No folding will take place once a native SQL query has been passed to the source system. For example, passing a SQL query directly through the Get Data dialog. The following query, specified in the Get Data dialog, is included in the Source Step: Figure 14: Providing a user defined native SQL query Any transformations applied after this native query will use local system resources. Therefore, the general implication for query development with native or user-defined SQL queries is that if they're used, try to include all required transformations (that is, joins and derived columns), or use them to utilize an important feature of the source database not being utilized by the folded query, such as an index. Not all data sources support query folding, such as text and Excel files. Not all transformations available in the Query Editor or via M functions directly are supported by some data sources. The privacy levels defined for the data sources will also impact whether folding is used or not. SQL statements are not parsed before they're sent to the source system. The Table.Buffer() function can be used to avoid query folding. The table output of this function is loaded into local memory and transformations against it will remain local. We have discussed effective techniques for accessing and retrieving data using Microsoft Power BI. Do check out this book Microsoft Power BI Cookbook for more information on using Microsoft power BI for data analysis and visualization. Expert Interview: Unlocking the secrets of Microsoft Power BI Tutorial: Building a Microsoft Power BI Data Model Expert Insights:Ride the third wave of BI with Microsoft Power BI    
Read more
  • 0
  • 0
  • 12172

article-image-getting-started-with-google-data-studio-an-intuitive-tool-for-visualizing-bigquery-data
Sugandha Lahoti
16 May 2018
8 min read
Save for later

Getting started with Google Data Studio: An intuitive tool for visualizing BigQuery Data

Sugandha Lahoti
16 May 2018
8 min read
Google Data Studio is one of the most popular tools for visualizing data. It can be used to pull data directly out of Google's suite of marketing tools, including Google Analytics, Google AdWords, and Google Search Console. It also supports connectors for database tools such as PostgreSQL and BigQuery, it can be accessed at datastudio.google.com. In this article, we will learn to visualize BigQuery Data with Google Data Studio. [box type="note" align="" class="" width=""]This article is an excerpt from the book, Learning Google BigQuery, written by Thirukkumaran Haridass and Eric Brown. This book will serve as a comprehensive guide to mastering BigQuery, and utilizing it to get useful insights from your Big Data.[/box] The following steps explain how to get started in Google Data Studio and access BigQuery data from Data Studio: Setting up an account: Account setup is extremely easy for Data Studio. Any user with a Google account is eligible to use all Data Studio features for free: Accessing BigQuery data: Once logged in, the next step is to connect to BigQuery. This can be done by clicking on the DATA SOURCES button on the left-hand-side navigation: You'll be prompted to create a data source by clicking on the large plus sign to the bottom-right of the screen. On the right-hand-side navigation, you'll get a list of all of the connectors available to you. Select BigQuery: At this point, you'll be prompted to select from your projects, shared projects, a custom query, or public datasets. Since you are querying the Google Analytics BigQuery Export test data, select Custom Query. Select the project you would like to use. In the Enter Custom Query prompt, add this query and click on the Connect button on the top right: SELECT trafficsource.medium as Medium, COUNT(visitId) as Visits FROM `google.com:analytics- bigquery.LondonCycleHelmet.ga_sessions_20130910` GROUP BY Medium This query will pull the count of sessions for traffic source mediums for the Google Analytics account that has been exported. The next screen shows the schema of the data source you have created. Here, you can make changes to each field of your data, such as changing text fields to date fields or creating calculated metrics: Click on Create Report. Then click on Add to Report. At this point, you will land on your report dashboard. Here, you can begin to create charts using the data you've just pulled from BigQuery. Icons for all the chart types available are shown near the top of the page. Hover over the chart types and click on the chart labeled Bar Chart; then in the grid, hold your right-click button to draw a rectangle. A bar chart should appear, with the Traffic Source Medium and Visit data from the query you ran: A properties prompt should also show on the right-hand side of the page: Here, a number of properties can be selected for your chart, including the dimension, metric, and many style settings. Once you've completed your first chart, more charts can be added to a single page to show other metrics if needed. For many situations, a single bar graph will answer the question at hand. Some situations may require more exploration. In such cases, an analyst might want to know whether the visit metric influences other metrics such as the number of transactions. A scatterplot with visits on the x axis and transactions on the y axis can be used to easily visualize this relationship. Making a scatterplot in Data Studio The following steps show how to make a scatterplot in Data Studio with the data from BigQuery: Update the original query by adding the transaction metric. In the edit screen of your report, click on the bar chart to bring up the chart options on the right-hand- side navigation. Click on the pencil icon next to the data source titled BigQuery to edit the data source. Click on the left-hand-side arrow icon titled Edit Connection: 3. In the dialog titled Enter Custom Query, add this query: SELECT trafficsource.medium as Medium, COUNT(visitId) as Visits, SUM(totals.transactions) AS Transactions FROM `google.com:analytics- bigquery.LondonCycleHelmet.ga_sessions_20130910` GROUP BY Medium Click on the button titled Reconnect in order to reprocess the query. A prompt should emerge, asking whether you'd like to add a new field titled Transactions. Click on Apply. Click on Done. Once you return to the report edit screen, click on the Scatter Chart button() and use your mouse to draw a square in the report space: The report should autoselect the two metrics you've created. Click on the chart to bring up the chart edit screen on the right-hand-side navigation; then click on the Style tab. Click on the dropdown under the Trendline option and select Linear to add a linear trend line, also known as linear regression line. The graph will default to blue, so use the pencil icon on the right to select red as the line color: Making a map in Data Studio Data Studio includes a map chart type that can be used to create simple maps. In order to create maps, a map dimension will need to be included in your data, along with a metric. Here, we will use the Google BigQuery public dataset for Medicare data. You'll need to create a new data source: Accessing BigQuery data: Once logged in, the next step is to connect to BigQuery. This can be done by clicking on the DATA SOURCES button on the left-hand-side navigation. You'll be prompted to create a data source by clicking on the large plus sign to the bottom-right of the screen. On the right-hand-side navigation, you'll get a list of all of the connectors available to you. Select BigQuery. At this point, you'll be prompted to select from your projects, shared projects, a custom query, or public datasets. Since you are querying the Google Analytics BigQuery Export test data, select Custom Query. Select the project you would like to use. In the Enter Custom Query prompt, add this query and click on the Connect button on the top right: SELECT CONCAT(provider_city,", ",provider_state) city, AVG(average_estimated_submitted_charges) avg_sub_charges FROM `bigquery-public-data.medicare.outpatient_charges_2014` WHERE apc = '0267 - Level III Diagnostic and Screening Ultrasound' GROUP BY 1 ORDER BY 2 desc This query will pull the average of submitted charges for diagnostic ultrasounds by city in the United States. This is the most submitted charge in the 2014 Medicaid data. The next screen shows the schema of the data source you have created. Here, you can make changes to each field of your data, such as changing text fields to date fields or creating calculated metrics: Click on Create Report. Then click on Add to Report. At this point, you will land on your report dashboard. Here, you can begin to create charts using the data you've just pulled from BigQuery. Icons for all the chart types available are shown near the top of the page. Hover over the chart types and click on the chart labeled Map  Chart; then in the grid, hold your right-click button to draw a rectangle. Click on the chart to bring up the Dimension Picker on the right-hand-side navigation, and click on Create New Dimension: Right click on the City dimension and select the Geo type and City subtype. Here, we can also choose other sub-types (Latitude, Longitude, Metro, Country, and so on). Data Studio will plot the top 500 rows of data (in this case, the top 500 cities in the results set). Hovering over each city brings up detailed data: Data Studio can also be used to roll up geographic data. In this case, we'll roll city data up to state data. From the edit screen, click on the map to bring up the Dimension Picker and click on Create New Dimension in the right-hand-side navigation. Right-click on the City dimension and select the Geo type and Region subtype. Google uses the term Region to signify states: Once completed, the map will be rolled up to the state level instead of the city level. This functionality is very handy when data has not been rolled up prior to being inserted into BigQuery: Other features of Data Studio Filtering: Filtering can be added to your visualizations based on dimensions or metrics as long as the data is available in the data source Data joins: Data for multiple sources can be joined to create new, calculated metrics Turnkey integrations with many Google Marketing Suite tools such as Adwords and Search Console We explored various features of Google Data Studio and learnt to use them for visualizing BigQuery data.To know about other third party tools for reporting and visualization purpose such as R and Tableau, check out the book Learning Google BigQuery. Getting Started with Data Storytelling What is Seaborn and why should you use it for data visualization? Pandas is an effective tool to explore and analyze data - Interview Insights
Read more
  • 0
  • 2
  • 9120