Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-handling-backup-and-recovery-in-postgresql-10

18 Jun 2018

11 min read

Handling backup and recovery in PostgreSQL 10 [Tutorial]

18 Jun 2018

Performing backups should be a regular task and every administrator is supposed to keep an eye on this vital stuff. Fortunately, PostgreSQL provides an easy means to create backups. In this tutorial, you will learn how to backup data by performing some simple dumps and also recover them using PostgreSQL 10. This article is an excerpt taken from, 'Mastering PostgreSQL 10' written by Hans-Jürgen Schönig. This book highlights the newly introduced features in PostgreSQL 10, and shows you how to build better PostgreSQL applications. Performing simple dumps If you are running a PostgreSQL setup, there are two major methods to perform backups: Logical dumps (extract an SQL script representing your data) Transaction log shipping The idea behind transaction log shipping is to archive binary changes made to the database. Most people claim that transaction log shipping is the only real way to do backups. However, in my opinion, this is not necessarily true. Many people rely on pg_dump to simply extract a textual representation of the data. pg_dump is also the oldest method of creating a backup and has been around since the very early days of the project (transaction log shipping was added much later). Every PostgreSQL administrator will become familiar with pg_dump sooner or later, so it is important to know how it really works and what it does. Running pg_dump The first thing we want to do is to create a simple textual dump: [hs@linuxpc ~]$ pg_dump test > /tmp/dump.sql This is the most simplistic backup you can imagine. pg_dump logs into the local database instance connects to a database test and starts to extract all the data, which will be sent to stdout and redirected to the file. The beauty is that standard output gives you all the flexibility of a Unix system. You can easily compress the data using a pipe or do whatever you want. In some cases, you might want to run pg_dump as a different user. All PostgreSQL client programs support a consistent set of command-line parameters to pass user information. If you just want to set the user, use the -U flag: [hs@linuxpc ~]$ pg_dump -U whatever_powerful_user test > /tmp/dump.sql The following set of parameters can be found in all PostgreSQL client programs: ... Connection options: -d, --dbname=DBNAME database to dump -h, --host=HOSTNAME database server host or socket directory -p, --port=PORT database server port number -U, --username=NAME connect as specified database user -w, --no-password never prompt for password -W, --password force password prompt (should happen automatically) --role=ROLENAME do SET ROLE before dump ... Just pass the information you want to pg_dump, and if you have enough permissions, PostgreSQL will fetch the data. The important thing here is to see how the program really works. Basically, pg_dump connects to the database and opens a large repeatable read transaction that simply reads all the data. Remember, repeatable read ensures that PostgreSQL creates a consistent snapshot of the data, which does not change throughout the transactions. In other words, a dump is always consistent—no foreign keys will be violated. The output is a snapshot of data as it was when the dump started. Consistency is a key factor here. It also implies that changes made to the data while the dump is running won't make it to the backup anymore. A dump simply reads everything—therefore, there are no separate permissions to be able to dump something. As long as you can read it, you can back it up. Also, note that the backup is by default in a textual format. This means that you can safely extract data from say, Solaris, and move it to some other CPU architecture. In the case of binary copies, that is clearly not possible as the on-disk format depends on your CPU architecture. Passing passwords and connection information If you take a close look at the connection parameters shown in the previous section, you will notice that there is no way to pass a password to pg_dump. You can enforce a password prompt, but you cannot pass the parameter to pg_dump using a command-line option. The reason for that is simple: the password might show up in the process table and be visible to other people. Therefore, this is not supported. The question now is: if pg_hba.conf on the server enforces a password, how can the client program provide it? There are various means of doing that: Making use of environment variables Making use of .pgpass Using service files In this section, you will learn about all three methods. Using environment variables One way to pass all kinds of parameters is to use environment variables. If the information is not explicitly passed to pg_dump, it will look for the missing information in predefined environment variables. A list of all potential settings can be found here: https://www.postgresql.org/docs/10/static/libpq-envars.html. The following overview shows some environment variables commonly needed for backups: PGHOST: It tells the system which host to connect to PGPORT: It defines the TCP port to be used PGUSER: It tells a client program about the desired user PGPASSWORD: It contains the password to be used PGDATABASE: It is the name of the database to connect to The advantage of these environments is that the password won't show up in the process table. However, there is more. Consider the following example: psql -U ... -h ... -p ... -d ... Suppose you are a system administrator: do you really want to type a long line like that a couple of times every day? If you are working with the very same host again and again, just set those environment variables and connect with plain SQL: [hs@linuxpc ~]$ export PGHOST=localhost [hs@linuxpc ~]$ export PGUSER=hs [hs@linuxpc ~]$ export PGPASSWORD=abc [hs@linuxpc ~]$ export PGPORT=5432 [hs@linuxpc ~]$ export PGDATABASE=test [hs@linuxpc ~]$ psql psql (10.1) Type "help" for help. As you can see, there are no command-line parameters anymore. Just type psql and you are in. All applications based on the standard PostgreSQLC-language client library (libpq) will understand these environment variables, so you cannot only use them for psql and pg_dump, but for many other applications. Making use of .pgpass A very common way to store login information is via the use of .pgpass files. The idea is simple: put a file called .pgpass into your home directory and put your login information there. The format is simple: hostname:port:database:username:password An example would be: 192.168.0.45:5432:mydb:xy:abc PostgreSQL offers some nice additional functionality: most fields can contain *. Here is an example: *:*:*:xy:abc This means that on every host, on every port, for every database, the user called xy will use abc as the password. To make PostgreSQL use the .pgpass file, make sure that the right file permissions are in place: chmod 0600 ~/.pgpass .pgpass can also be used on a Windows system. In this case, the file can be found in the %APPDATA%postgresqlpgpass.conf path. Using service files However, there is not just the .pgpass file. You can also make use of service files. Here is how it works. If you want to connect to the very same servers over and over again, you can create a .pg_service.conf file. It will hold all the connection information you need. Here is an example of a .pg_service.conf file: Mac:~ hs$ cat .pg_service.conf # a sample service [hansservice] host=localhost port=5432 dbname=test user=hs password=abc [paulservice] host=192.168.0.45 port=5432 dbname=xyz user=paul password=cde To connect to one of the services, just set the environment and connect: iMac:~ hs$ export PGSERVICE=hansservice A connection can now be established without passing parameters to psql: iMac:~ hs$ psql psql (10.1) Type "help" for help. test=# Alternatively, you can use: psql service=hansservice Extracting subsets of data Up to now, you have seen how to dump an entire database. However, this is not what you might wish for. In many cases, you might just want to extract a subset of tables or schemas. pg_dump can do that and provides a number of switches: -a: It dumps only the data and does not dump the data structure -s: It dumps only the data structure but skips the data -n: It dumps only a certain schema -N: It dumps everything but excludes certain schemas -t: It dumps only certain tables -T: It dumps everything but certain tables (this can make sense if you want to exclude logging tables and so on) Partial dumps can be very useful to speed things up considerably. Handling various formats So far, you have seen that pg_dump can be used to create text files. The problem is that a text file can only be replayed completely. If you have saved an entire database, you can only replay the entire thing. In many cases, this is not what you want. Therefore, PostgreSQL has additional formats that also offer more functionality. At this point, four formats are supported: -F, --format=c|d|t|p output file format (custom, directory, tar, plain text (default)) You have already seen plain, which is just normal text. On top of that, you can use a custom format. The idea behind a custom format is to have a compressed dump, including a table of contents. Here are two ways to create a custom format dump: [hs@linuxpc ~]$ pg_dump -Fc test > /tmp/dump.fc [hs@linuxpc ~]$ pg_dump -Fc test -f /tmp/dump.fc In addition to the table of contents, the compressed dump has one more advantage: it is a lot smaller. The rule of thumb is that a custom format dump is around 90% smaller than the database instance you are about to back up. Of course, this highly depends on the number of indexes and all that, but for many database applications, this rough estimation will hold true. Once you have created the backup, you can inspect the backup file: [hs@linuxpc ~]$ pg_restore --list /tmp/dump.fc ; ; Archive created at 2017-11-04 15:44:56 CET ; dbname: test ; TOC Entries: 18 ; Compression: -1 ; Dump Version: 1.12-0 ; Format: CUSTOM ; Integer: 4 bytes ; Offset: 8 bytes ; Dumped from database version: 10.1 ; Dumped by pg_dump version: 10.1 ; ; Selected TOC Entries: ; 3103; 1262 16384 DATABASE - test hs 3; 2615 2200 SCHEMA - public hs 3104; 0 0 COMMENT - SCHEMA public hs 1; 3079 13350 EXTENSION - plpgsql 3105; 0 0 COMMENT - EXTENSION plpgsql 187; 1259 16391 TABLE public t_test hs ... pg_restore --list will return the table of contents of the backup. Using a custom format is a good idea as the backup will shrink in size. However, there is more; the -Fd command will create a backup in a directory format. Instead of a single file, you will now get a directory containing a couple of files: [hs@linuxpc ~]$ mkdir /tmp/backup [hs@linuxpc ~]$ pg_dump -Fd test -f /tmp/backup/ [hs@linuxpc ~]$ cd /tmp/backup/ [hs@linuxpc backup]$ ls -lh total 86M -rw-rw-r--. 1 hs hs 85M Jan 4 15:54 3095.dat.gz -rw-rw-r--. 1 hs hs 107 Jan 4 15:54 3096.dat.gz -rw-rw-r--. 1 hs hs 740K Jan 4 15:54 3097.dat.gz -rw-rw-r--. 1 hs hs 39 Jan 4 15:54 3098.dat.gz -rw-rw-r--. 1 hs hs 4.3K Jan 4 15:54 toc.dat One advantage of the directory format is that you can use more than one core to perform the backup. In the case of a plain or custom format, only one process will be used by pg_dump. The directory format changes that rule. The following example shows how you can tell pg_dump to use four cores (jobs): [hs@linuxpc backup]$ rm -rf * [hs@linuxpc backup]$ pg_dump -Fd test -f /tmp/backup/ -j 4 Note that the more objects you have in your database, the more potential speedup there will be. To summarize, you learned about creating backups in general. If you've enjoyed reading this post, do check out 'Mastering PostgreSQL 10' to know how to replay backup and handle global data in PostgreSQL10. You will learn how to use PostgreSQL onboard tools to replicate instances. PostgreSQL 11 Beta 1 is out! How to perform data partitioning in PostgreSQL 10 How to implement Dynamic SQL in PostgreSQL 10

0
0
4560

article-image-data-professionals-planning-to-learn-this-year-python-deep-learning

Amey Varangaonkar

14 Jun 2018

4 min read

What are data professionals planning to learn this year? Python, deep learning, yes. But also...

Amey Varangaonkar

14 Jun 2018

4 min read

One thing that every data professional absolutely dreads is the day their skills are no longer relevant in the market. In an ever-changing tech landscape, one must be constantly on the lookout for the most relevant, industrially-accepted tools and frameworks. This is applicable everywhere - from application and web developers to cybersecurity professionals. Not even the data professionals are excluded from this, as new ways and means to extract actionable insights from raw data are being found out almost every day. Gone are the days when data pros stuck to a single language and a framework to work with their data. Frameworks are more flexible now, with multiple dependencies across various tools and languages. Not just that, new domains are being identified where these frameworks can be applied, and how they can be applied varies massively as well. A whole new arena of possibilities has opened up, and with that new set of skills and toolkits to work on these domains have also been unlocked. What’s the next big thing for data professionals? We recently polled thousands of data professionals as part of our Skill-Up program, and got some very interesting insights into what they think the future of data science looks like. We asked them what they were planning to learn in the next 12 months. The following word cloud is the result of their responses, weighted by frequency of the tools they chose: What data professionals are planning on learning in the next 12 months Unsurprisingly, Python comes out on top as the language many data pros want to learn in the coming months. With its general-purpose nature and innumerable applications across various use-cases, Python’s sky-rocketing popularity is the reason everybody wants to learn it. Machine learning and AI are finding significant applications in the web development domain today. They are revolutionizing the customers’ digital experience through conversational UIs or chatbots. Not just that, smart machine learning algorithms are being used to personalize websites and their UX. With all these reasons, who wouldn’t want to learn JavaScript, as an important tool to have in their data science toolkit? Add to that the trending web dev framework Angular, and you have all the tools to build smart, responsive front-end web applications. We also saw data professionals taking active interest in the mobile and cloud domains as well. They aim to learn Kotlin and combine its power with the data science tools for developing smarter and more intelligent Android apps. When it comes to the cloud, Microsoft’s Azure platform has introduced many built-in machine learning capabilities, as well as a workbench for data scientists to develop effective, enterprise-grade models. Data professionals also prefer Docker containers to run their applications seamlessly, and hence its learning need seems to be quite high. [box type="shadow" align="" class="" width=""]Has machine learning with JavaScript caught your interest? Don’t worry, we got you covered - check out Hands-on Machine Learning with JavaScript for a practical, hands-on coverage of the essential machine learning concepts using the leading web development language. [/box] With Crypto’s popularity off the roof (sadly, we can’t say the same about Bitcoin’s price), data pros see Blockchain as a valuable skill. Building secure, decentralized apps is on the agenda for many, perhaps. Cloud, Big Data, Artificial Intelligence are some of the other domains that the data pros find interesting, and feel worth skilling up in. Work-related skills that data pros want to learn We also asked the data professionals what skills the data pros wanted to learn in the near future that could help them with their daily jobs more effectively. The following word cloud of their responses paints a pretty clear picture: Valuable skills data professionals want to learn for their everyday work As Machine learning and AI go mainstream, so do their applications in mainstream domains - often resulting in complex problems. Well, there’s deep learning and specifically neural networks to tackle these problems, and these are exactly the skills data pros want to master in order to excel at their work. [box type="shadow" align="" class="" width=""]Data pros want to learn Machine Learning in Python. Do you? Here’s a useful resource for you to get started - check out Python Machine Learning, Second Edition today![/box] So, there it is! What are the tools, languages or frameworks that you are planning to learn in the coming months? Do you agree with the results of the poll? Do let us know. What are web developers favorite front-end tools? Packt’s Skill Up report reveals all Data cleaning is the worst part of data analysis, say data scientists 15 Useful Python Libraries to make your Data Science tasks Easier

0
0
4132

article-image-alarming-ways-governments-use-surveillance-tech

Neil Aitken

14 Jun 2018

12 min read

Alarming ways governments are using surveillance tech to watch you

Neil Aitken

14 Jun 2018

12 min read

Mapquest, part of the Verizon company, is the second largest provider of mapping services in the world, after Google Maps. It provides advanced cartography services to companies like Snap and PapaJohns pizza. The company is about to release an app that users can install on their smartphone. Their new application will record and transmit video images of what’s happening in front of your vehicle, as you travel. Data can be sent from any phone with a camera – using the most common of tools – a simple mobile data plan, for example. In exchange, you’ll get live traffic updates, among other things. Mapquest will use the video image data they gather to provide more accurate and up to date maps to their partners. The real world is changing all the time – roads get added, cities re-route traffic from time to time. The new AI based technology Mapquest employ could well improve the reliability of driverless cars, which have to engage with this ever changing landscape, in a safe manner. No-one disagrees with safety improvements. Mapquests solution is impressive technology. The fact that they can use AI to interpret the images they see and upload the information they receive to update maps is incredible. And, in this regard, the company is just one of the myriad daily news stories which excite and astound us. These stories do, however, often have another side to them which is rarely acknowledged. In the wrong hands, Mapquest’s solution could create a surveillance database which tracked people in real time. Surveillance technology involves the use of data and information products to capture details about individuals. The act of surveillance is usually undertaken with a view to achieving a goal. The principle is simple. The more ‘they’ know about you, the easier it will be to influence you towards their ends. Surveillance information can be used to find you, apprehend you or potentially, to change your mind, without even realising that you had been watched. Mapquest’s innovation is just a single example of surveillance technology in government hands which has expanded in capability far beyond what most people realise. Read also: What does the US government know about you? The truth beyond the Facebook scandal Facebook’s share price fell 14% in early 2018 as a result of public outcry related to the Cambridge Analytica announcements the company made. The idea that a private company had allowed detailed information about individuals to be provided to a third party without their consent appeared to genuinely shock and appall people. Technology tools like Mapquest’s tracking capabilities and Facebook’s profiling techniques are being taken and used by police forces and corporate entities around the world. The reality of current private and public surveillance capabilities is that facilities exist, and are in use, to collect and analyse data on most people in the developing world. The known limits of these services may surprise even those who are on the cutting edge of technology. There are so many examples from all over the world listed below that will genuinely make you want to consider going off grid! Innovative, Ingenious overlords: US companies have a flare for surveillance The US is the centre for information based technology companies. Much of what they develop is exported as well as used domestically. The police are using human genome matching to track down criminals and can find ‘any family in the country’ There have been 2 recent examples of police arresting a suspect after using human genome databases to investigate crimes. A growing number of private individuals have now used publicly available services such as 23andme to sequence their genome (DNA) either to investigate further their family tree, or to determine the potential of a pre-disposition to the gene based component of a disease. In one example, The Golden State Killer, an ex cop, was arrested 32 years after the last reported rape in a series of 45 (in addition to 12 murders) which occurred between 1976 and 1986. To track him down, police approached sites like 23andme with DNA found at crime scenes, established a family match and then progressed the investigation using conventional means. More than 12 million Americans have now used a genetic sequencing service and it is believed that investigators could find a family match for the DNA of anyone who has committed a crime in America. In simple terms, whether you want it or not, the law enforcement has the DNA of every individual in the country available to them. Domain Awareness Centers (DAC) bring the Truman Show to life The 400,000 Residents of Oakland, California discovered in 2012, that they had been the subject of an undisclosed mass surveillance project, by the local police force, for many years. Feeds from CCTV cameras installed in Oakland’s suburbs were augmented with weather information feeds, social media feeds and extracted email conversations, as well as a variety of other sources. The scheme began at Oakland’s port with Federal funding as part of a national response to the events of 9.11.2001 but was extended to cover the near half million residents of the city. Hundreds of additional video cameras were installed, along with gunshot recognition microphones and some of the other surveillance technologies provided in this article. The police force conducting the surveillance had no policy on what information was recorded or for how long it was kept. Internet connected toys spy on children The FBI has warned Americans that children’s toys connected to the internet ‘could put the privacy and safety of children at risk.' Children’s toy Hello Barbie was specifically admonished for poor privacy controls as part of the FBI’s press release. Internet connected toys could be used to record video of children at any point in the day or, conceivably, to relay a human voice, making it appear to the child that the toy was talking to them. Oracle suggest Google’s Android operating system routinely tracks users’ position even when maps are turned off In Australia, two American companies have been involved in a disagreement about the potential monitoring of Android phones. Oracle accused Google of monitoring users’ location (including altitude), even when mapping software is turned off on the device. The tracking is performed in the background of their phone. In Australia alone, Oracle suggested that Google’s monitoring could involve around 1GB of additional mobile data every month, costing users nearly half a billion dollars a year, collectively. Amazon facial recognition in real time helps US law enforcement services Amazon are providing facial recognition services which take a feed from public video cameras, to a number of US Police Forces. Amazon can match images taken in real time to a database containing ‘millions of faces.’ Are there any state or Federal rules in place to govern police facial recognition? Wired reported that there are ‘more or less none.’ Amazon’s scheme is a trial taking place in Florida. There are at least 2 other companies offering similar schemes in the US to law enforcement services. Big glass microphone can help agencies keep an ear on the ground Project ‘Big Glass Microphone’ uses the vibrations that the movements of cars (among other things) cause in the buried fiber optic telecommunications links. A successful test of the technology has been undertaken on the fiber optic cables which run underground on the Stanford University Campus, to record vehicle movements. Fiber optic links now make up the backbone of much data transport infrastructure - the way your phone and computer connect to the internet. Big glass microphone as it stands is the first step towards ‘invisible’ monitoring of people and their assets. It appears the FBI now have the ability to crack/access any phone Those in the know suggest that Apple’s iPhone is the most secure smart device against government surveillance. In 2016, this was put to the test. The Justice Department came into possession of an iPhone allegedly belonging to one of the San Bernadino shooters and ultimately sued Apple in an attempt to force the company to grant access to it, as part of their investigation. The case was ultimately dropped leading some to speculate that NAND mirroring techniques were used to gain access to the phone without Apple’s assistance, implying that even the most secure phones can now be accessed by authorities. Cornell University’s lie detecting algorithm Groundbreaking work by Cornell University will provide ‘at a distance’ access to information that previously required close personal access to an accused subject. Cornell’s solution interprets feeds from a number of video cameras on subjects and analyses the results to judge their heart rate. They believe the system can be used to determine if someone is lying from behind a screen. University of Southern California can anticipate social unrest with social media feeds Researchers at the University Of Southern California have developed an AI tool to study Social Media posts and determine whether those writing them are likely to cause Social Unrest. The software claims to have identified an association between both the volume of tweets written / the content of those tweets and protests turning physical. They can now offer advice to law enforcement on the likelihood of a protest turning violent so they can be properly prepared based on this information. The UK, an epicenter of AI progress, is not far behind in tracking people The UK has a similarly impressive array of tools at its disposal to watch the people that representatives of the country feels may be required. Given the close levels of cooperation between the UK and US governments, it is likely that many of these UK facilities are shared with the US and other NATO partners. Project stingray – fake cell phone/mobile phone ‘towers’ to intercept communications Stingray is a brand name for an IMSI (the unique identifier on a SIM card) tracker. They ‘spoof’ real towers, presenting themselves as the closest mobile phone tower. This ‘fools’ phones in to connecting to them. The technology has been used to spy on criminals in the UK but it is not just the UK government which use Stingray or its equivalents. The Washington Post reported in June 2018 that a number of domestically compiled intelligence reports suggest that foreign governments acting on US soil, including China and Russia, have been eavesdropping on the Whitehouse, using the same technology. UK developed Spyware is being used by authoritarian regimes Gamma International is a company based in Hampshire UK, which provided the (notably authoritarian) Egyptian government with a facility to install what was effectively spyware delivered with a virus on to computers in their country. Once installed, the software permitted the government to monitor private digital interactions, without the need to engage the phone company or ISP offering those services. Any internet based technology could be tracked, assisting in tracking down individuals who may have negative feelings about the Egyptian government. Individual arrested when his fingerprint was taken from a WhatsApp picture of his hand A Drug Dealer was pictured holding an assortment of pills in the UK two months ago. The image of his hand was used to extract an image of his fingerprint. From that, forensic scientists used by UK police, confirmed that officers had arrested the correct person and associated him with drugs. AI solutions to speed up evidence processing including scanning laptops and phones UK police forces are trying out AI software to speed up processing evidence from digital devices. A dozen departments around the UK are using software, called Cellebrite, which employs AI algorithms to search through data found on devices, including phones and laptops. Cellbrite can recognize images that contain child abuse, accepts feeds from multiple devices to see when multiple owners were in the same physical location at the same time and can read text from screenshots. Officers can even feed it photos of suspects to see if a picture of them show up on someone’s hard drive. China takes the surveillance biscuit and may show us a glimpse of the future There are 600 million mobile phone users in China, each producing a great deal of information about their users. China has a notorious record of human rights abuses and the ruling Communist Party takes a controlling interest (a board seat) in many of their largest technology companies, to ensure the work done is in the interest of the party as well as profitable for the corporate. As a result, China is on the front foot when it comes to both AI and surveillance technology. China’s surveillance tools could be a harbinger of the future in the Western world. Chinese cities will be run by a private company Alibaba, China’s equivalent of Amazon, already has control over the traffic lights in one Chinese city, Hangzhou. Alibaba is far from shy about it’s ambitions. It has 120,000 developers working on the problem and intends to commercialise and sell the data it gathers about citizens. The AI based product they’re using is called CityBrain. In the future, all Chinese cities could well all be run by AI from the Alibaba corporation the idea is to use this trial as a template for every city. The technology is likely to be placed in Kuala Lumpur next. In the areas under CityBrain’s control, traffic speeds have increased by 15% already. However, some of those observing the situation have expressed concerns not just about (the lack of) oversight on CityBrain’s current capabilities but the potential for future abuse. What to make of this incredible list of surveillance capabilities Facilities like Mapquest’s new mapping service are beguiling. They’re clever ideas which create a better works. Similar technology, however, behind the scenes, is being adopted by law enforcement bodies in an ever growing list of countries. Even for someone who understands cutting edge technology, the sum of those facilities may be surprising. Literally any aspect of your behaviour, from the way you walk, to your face, your heatmap and, of course, the contents of your phone and laptops can now be monitored. Law enforcement can access and review information feeds with Artificial Intelligence software, to process and summarise findings quickly. In some cases, this is being done without the need for a warrant. Concerningly, these advances seem to be coming without policy or, in many cases any form of oversight. We must change how we think about AI, urge AI founding fathers

0
0
3479

article-image-create-connection-qlik-engine-tip

Amey Varangaonkar

13 Jun 2018

8 min read

5 ways to create a connection to the Qlik Engine [Tip]

Amey Varangaonkar

13 Jun 2018

8 min read

With mashups or web apps, the Qlik Engine sits outside of your project and is not accessible and loaded by default. The first step before doing anything else is to create a connection with the Qlik Engine, after which you can continue to open a session and perform further actions on that app, such as: Opening a document/app Making selections Retrieving visualizations and apps For using the Qlik Engine API, open a WebSocket to the engine. There may be a difference in the way you do this, depending on whether you are working with Qlik Sense Enterprise or Qlik Sense Desktop. In this article, we will elaborate on how you can achieve a connection to the Qlik engine and the benefits of doing so. The following excerpt has been taken from the book Mastering Qlik Sense, authored by Martin Mahler and Juan Ignacio Vitantonio. Creating a connection To create a connection using WebSockets, you first need to establish a new web socket communication line. To open a WebSocket to the engine, use one of the following URIs: Qlik Sense Enterprise Qlik Sense Desktop wss://server.domain.com:4747/app/ or wss://server.domain.com[/virtual proxy]/app/ ws://localhost:4848/app Creating a Connection using WebSockets In the case of Qlik Sense Desktop, all you need to do is define a WebSocket variable, including its connection string in the following way: var ws = new WebSocket("ws://localhost:4848/app/"); Once the connection is opened and checking for ws.open(), you can call additional methods to the engine using ws.send(). This example will retrieve the number of available documents in my Qlik Sense Desktop environment, and append them to an HTML list: <html> <body> <ul id='docList'> </ul> </body> </html> <script> var ws = new WebSocket("ws://localhost:4848/app/"); var request = { "handle": -1, "method": "GetDocList", "params": {}, "outKey": -1, "id": 2 } ws.onopen = function(event){ ws.send(JSON.stringify(request)); // Receive the response ws.onmessage = function (event) { var response = JSON.parse(event.data); if(response.method != ' OnConnected'){ var docList = response.result.qDocList; var list = ''; docList.forEach(function(doc){ list += '<li>'+doc.qDocName+'</li>'; }) document.getElementById('docList').innerHTML = list; } } } </script> The preceding example will produce the following output on your browser if you have Qlik Sense Desktop running in the background: All Engine methods and calls can be tested in a user-friendly way by exploring the Qlik Engine in the Dev Hub. A single WebSocket connection can be associated with only one engine session (consisting of the app context, plus the user). If you need to work with multiple apps, you must open a separate WebSocket for each one. If you wish to create a WebSocket connection directly to an app, you can extend the configuration URL to include the application name, or in the case of the Qlik Sense Enterprise, the GUID. You can then use the method from the app class and any other classes as you continue to work with objects within the app. var ws = new WebSocket("ws://localhost:4848/app/MasteringQlikSense.qvf"); Creating Connection to the Qlik Server Engine Connecting to the engine on a Qlik Sense environment is a little bit different as you will need to take care of authentication first. Authentication is handled in different ways, depending on how you have set up your server configuration, with the most common ones being: Ticketing Certificates Header authentication Authentication also depends on where the code that is interacting with the Qlik Engine is running. If your code is running on a trusted computer, authentication can be performed in several ways, depending on how your installation is configured and where the code is running: If you are running the code from a trusted computer, you can use certificates, which first need to be exported via the QMC If the code is running on a web browser, or certificates are not available, then you must authenticate via the virtual proxy of the server Creating a connection using certificates Certificates can be considered as a seal of trust, which allows you to communicate with the Qlik Engine directly with full permission. As such, only backend solutions ever have access to certificates, and you should guard how you distribute them carefully. To connect using certificates, you first need to export them via the QMC, which is a relatively easy thing to do: Once they are exported, you need to copy them to the folder where your project is located using the following code: <html> <body> <h1>Mastering QS</h1> </body> <script> var certPath = path.join('C:', 'ProgramData', 'Qlik', 'Sense', 'Repository', 'Exported Certificates', '.Local Certificates'); var certificates = { cert: fs.readFileSync(path.resolve(certPath, 'client.pem')), key: fs.readFileSync(path.resolve(certPath, 'client_key.pem')), root: fs.readFileSync(path.resolve(certPath, 'root.pem')) }; // Open a WebSocket using the engine port (rather than going through the proxy) var ws = new WebSocket('wss://server.domain.com:4747/app/', { ca: certificates.root, cert: certificates.cert, key: certificates.key, headers: { 'X-Qlik-User': 'UserDirectory=internal; UserId=sa_engine' } }); ws.onopen = function (event) { // Call your methods } </script> Creating a connection using the Mashup API Now, while connecting to the engine is a fundamental step to start interacting with Qlik, it's very low-level, connecting via WebSockets. For advanced use cases, the Mashup API is one way to help you get up to speed with a more developer-friendly abstraction layer. The Mashup API utilizes the qlik interface as an external interface to Qlik Sense, used for mashups and for including Qlik Sense objects in external web pages. To load the qlik module, you first need to ensure RequireJS is available in your main project file. You will then have to specify the URL of your Qlik Sense environment, as well as the prefix of the virtual proxy, if there is one: <html> <body> <h1>Mastering QS</h1> </body> </html> <script src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.3.5/require.min.js"> <script> //Prefix is used for when a virtual proxy is used with the browser. var prefix = window.location.pathname.substr( 0, window.location.pathname.toLowerCase().lastIndexOf( "/extensions" ) + 1 ); //Config for retrieving the qlik.js module from the Qlik Sense Server var config = { host: window.location.hostname, prefix: prefix, port: window.location.port, isSecure: window.location.protocol === "https:" }; require.config({ baseUrl: (config.isSecure ? "https://" : "http://" ) + config.host + (config.port ? ":" + config.port : "" ) + config.prefix + "resources" }); require(["js/qlik"], function (qlik) { qlik.setOnError( function (error) { console.log(error); }); //Open an App var app = qlik.openApp('MasteringQlikSense.qvf', config); </script> Once you have created the connection to an app, you can start leveraging the full API by conveniently creating HyperCubes, connecting to fields, passing selections, retrieving objects, and much more. The Mashup API is intended for browser-based projects where authentication is handled in the same way as if you were going to open Qlik Sense. If you wish to use the Mashup API, or some parts of it, with a backend solution, you need to take care of authentication first. Creating a connection using enigma.js Enigma is Qlik's open-source promise wrapper for the engine. You can use enigma directly when you're in the Mashup API, or you can load it as a separate module. When you are writing code from within the Mashup API, you can retrieve the correct schema directly from the list of available modules which are loaded together with qlik.js via 'autogenerated/qix/engine-api'. The following example will connect to a Demo App using enigma.js: define(function () { return function () { require(['qlik','enigma','autogenerated/qix/engine-api'], function (qlik, enigma, schema) { //The base config with all details filled in var config = { schema: schema, appId: "My Demo App.qvf", session:{ host:"localhost", port: 4848, prefix: "", unsecure: true, }, } //Now that we have a config, use that to connect to the //QIX service. enigma.getService("qix" , config).then(function(qlik){ qlik.global.openApp(config.appId) //Open App qlik.global.openApp(config.appId).then(function(app){ //Create SessionObject for FieldList app.createSessionObject( { qFieldListDef: { qShowSystem: false, qShowHidden: false, qShowSrcTables: true, qShowSemantic: true, qShowDerivedFields: true }, qInfo: { qId: "FieldList", qType: "FieldList" } } ).then( function(list) { return list.getLayout(); } ).then( function(listLayout) { return listLayout.qFieldList.qItems; } ).then( function(fieldItems) { console.log(fieldItems) } ); }) } })}}) It's essential to also load the correct schema whenever you load enigma.js. The schema is a collection of the available API methods that can be utilized in each version of Qlik Sense. This means your schema needs to be in sync with your QS version. Thus, we see it is fairly easy to create a stable connection with the Qlik Engine API. If you liked the above excerpt, make sure you check out the book Mastering Qlik Sense to learn more tips and tricks on working with different kinds of data using Qlik Sense and extract useful business insights. How Qlik Sense is driving self-service Business Intelligence Overview of a Qlik Sense® Application’s Life Cycle What we learned from Qlik Qonnections 2018

0
0
14640

article-image-how-to-prevent-errors-while-using-utilities-for-loading-data-in-teradata

Pravin Dhandre

11 Jun 2018

9 min read

How to prevent errors while using utilities for loading data in Teradata

Pravin Dhandre

11 Jun 2018

9 min read

In today’s tutorial we will assist you to overcome the errors that arise while loading, deleting or updating large volumes of data using Teradata Utilities. [box type="note" align="" class="" width=""]This article is an excerpt from Teradata Cookbook co-authored by Abhinav Khandelwal and Rajsekhar Bhamidipati. This book provides recipes to simplify the daily tasks performed by database administrators (DBA) along with providing efficient data warehousing solutions in Teradata database system.[/box] Resolving FastLoad error 2652 When data is being loaded via FastLoad, a table lock is placed on the target table. This means that the table is unavailable for any other operation. A lock on a table is only released when FastLoad encounters the END LOADING command, which terminates phase 2, the so-called application phase. FastLoad may get terminated in phase 1 due to any of the following reasons: Load script results in failure (error code 8 or 12) Load script is aborted by admin or some other session FastLoad fails due to bad record or file Forgetting to add end loading statement in script If so, it keeps a lock on the table, which needs to be released manually. In this recipe, we will see the steps to release FastLoad locks. Getting ready Identify the table on which FastLoad is been ended prematurely and tables are in locked state. You need to have valid credentials for the Teradata Database. Execute the dummy FastLoad script from the same user or the user which has write access to the lock table. A user requires the following privileges/rights in order to execute the FastLoad: SELECT and INSERT (CREATE and DROP or DELETE) access to the target or loading table CREATE and DROP TABLE on error tables SELECT, INSERT, UPDATE, and DELETE are required privileges for the user PUBLIC on the restart log table (SYSADMIN.FASTLOG). There will be a row in the FASTLOG table for each FastLoad job that has not completed in the system. How to do it... Open a notepad and create the following script: .LOGON 127.0.0.1/dbc, dbc; /* Vaild system name and credentials to your system */ .DATABASE Database_Name; /* database under which locked table is */ erorfiles errortable_name, uv_tablename /* same error table name as in script */ begin loading locked_table; /* table which is getting 2652 error */ .END LOADING; /* to end pahse 2 and release the lock */ .LOGOFF; Save it as dummy_fl.txt. Open the windows Command Prompt and execute this using the FastLoad command, as shown in the following screenshot: This dummy script with no insert statement should release the lock on the target Table. Execute Select on the locked table to see if the lock is released on the table. How it works... As FastLoad is designed to work only on empty tables, it becomes necessary that the loading of the table finishes in one go. If the load script is errored out prematurely in phase 2, without encountering the END loading command, it leaves a lock on loading the table. Fastload locks can't be released via the HUT utility, as there are no technical lock on the table. To execute FastLoad, the following are some requirements: Log table: FastLoad puts its progress information in the fastlog table. EMPTY TABLE: FastLoad needs the table to be empty before inserting rows into that table. TWO ERROR TABLES: FastLoad requires two error tables to be created; you just need to name them, and no ddl is required. The first error table records any translation or constraint violation error, whereas the second error table captures errors related to the duplication of values for Unique Primary Indexes (UPI). After the completion of FastLoad, you can analyze these error tables as to why the records got rejected. There's more... If this does not fix the issue, you need to drop the target table and error tables associated with it. Before proceeding with dropping tables, check with the administrator to abort any FastLoad sessions associated with this table. Resolving MLOAD error 2571 MLOAD works in five phases, unlike FastLoad, which only works in two phases. MLOAD can fail in either phase three or four. Figure shows 5 stages of MLOAD. Preliminary: Basic setup. Syntax checking, establishing session with the Teradata Database, creation of error tables (two error tables per target table), and the creation of work tables and log tables are done in this phase. DML Transaction phase: Request is parse through PE and a step plan is generated. Steps and DML are then sent to AMP and stored in appropriate work tables for each target table. Input data sent will be stored in these work tables, which will be applied to the target table later on. Acquisition phase: Unsorted data is sent to AMP in blocks of 64K. Rows are hashed by PI and sent to appropriate AMPs. Utility places locks on target tables in preparation for the application phase to apply rows in target tables. Application phase: Changes are applied to target tables and NUSI subtables. Lock on table is held in this phase. Cleanup phase: If the error code of all the steps is 0, MLOAD successfully completes and releases all the locks on the specified table. This being the case, all empty error tables, worktables, and the log table are dropped. Getting ready Identify the table which is getting affected by error 2571. Make sure no host utility is running on this table and the load job is in a failed state for this table. How to do it... Check on viewpoint for any active utility job for this table. If you find any active job, let it complete. If there is a reason that you need to release the lock, first abort all the sessions of the host utility from viewpoint. Ask your administrator to do it. Execute the following command: RELEASE MLOAD <databasename.tablename>; > If you get a Not able to release MLOAD Lock error, execute the following Command: /* Release lock in application phase */ RELEASE MLOAD <databasename.tablename> in apply; Once the locks are released you need to drop all the associated error tables, the log table, and work tables with it. Re-execute MLOAD after correcting the error. How it works... The Mload utility places a lock in table headers to alert other utilities that a MultiLoad is in session for this table. They include: Acquisition lock: DML allows all DDL allows DROP only Application lock: DML allows SELECT with ACCESS only DDL allows DROP only There's more... If the release lock statement still gives an error and does not release the lock on the table, you need to use SELECT with the ACCESS lock to copy the content of the locked table to a new one and drop the locked tables. If you start receiving the error 7446 Mload table %ID cannot be released because NUSI exists, you need to drop all the NUSI on the table and use ALTER Table to nonfallback to accomplish the task. Resolving failure 7547 This error is associated with the UPDATE statement, which could be SQL based or could be in MLOAD. Various times, while updating the set of rows in a table, the update fails on Failure 7547 Target row updated by multiple source rows. This error will happen when you update the target with multiple rows from the source. This means there are duplicated values present in the source tables. Getting ready Let's create sample volatile tables and insert values into them. After that, we will execute the UPDATE command, which will fail to result in 7547: Create a TARGET TABLE with the following DDL and insert values into it: ** TARGET TABLE** create volatile table accounts ( CUST_ID, CUST_NAME, Sal )with data primary index(cust_id) insert values (1,'will',2000); insert values (2,'bekky',2800); insert values (3,'himesh',4000); Create a SOURCE TABLE with the following DDL and insert values into it: ** SOURCE TABLE** create volatile table Hr_payhike ( CUST_ID, CUST_NAME, Sal_hike ) with data primary index(cust_id) insert values (1,'will',2030); insert values (1,'bekky',3800); insert values (3,'himesh',7000); Execute the MLOAD script. Following the snippet from the MLOAD script, only update part (which will fail): /* Snippet from MLOAD update */ UPDATE ACC FROM ACCOUNTS ACC , Hr_payhike SUPD SET Sal= TUPD.Sal_hike WHERE Acc.CUST_ID = SUPD.CUST_ID; Failure: Target row updated by multiple source rows How to do it... Check for duplicate values in the source table using the following: /*Check for duplicate values in source table*/ SELECT cust_id,count(*) from Hr_payhike group by 1 order by 2 desc The output will be generated with CUST_ID =1 and has two values which are causing errors. The reason for this is that while updating the TARGET table, the optimizer won't be able to understand from which row it should update the TARGET row. Who's salary will be updated Will or Bekky? To resolve the error, execute the following update query: /* Update part of MLOAD */ UPDATE ACC FROM ACCOUNTS ACC , ( SELECT CUST_ID, CUST_NAME, SAL_HIKE FROM Hr_payhike QUALIFY ROW_NUMBER() OVER (PARTITION BY CUST_ID ORDER BY CUST_NAME,SAL_HIKE DESC)=1) SUPD SET Sal= SUPD.Sal_hike WHERE Acc.CUST_ID = SUPD.CUST_ID; Now, the update will run without error. How it works... Failure will happen when you update the target with multiple rows from the source. If you defined a primary index column for your target, and if those columns are in an update query condition, this error will occur. To further resolve this, you can delete the duplicate from the source table itself and execute the original update without any modification. But if the source data can't be changed, then you need to change the update statement. To summarize, we have successfully learned how to overcome or prevent errors while using utilities for loading data into database. You could also check out the Teradata Cookbook for more than 100 recipes on enterprise data warehousing solutions. 2018 is the year of graph databases. Here’s why. 6 reasons to choose MySQL 8 for designing database solutions Amazon Neptune, AWS’ cloud graph database, is now generally available

0
0
12416

article-image-3-ways-to-use-indexes-in-teradata-to-improve-database-performance

Pravin Dhandre

11 Jun 2018

15 min read

3 ways to use Indexes in Teradata to improve database performance

Pravin Dhandre

11 Jun 2018

15 min read

0
0
14240

article-image-feedforward-networks-tensorflow

Aarthi Kumaraswamy

07 Jun 2018

12 min read

Implementing feedforward networks with TensorFlow

Aarthi Kumaraswamy

07 Jun 2018

12 min read

0
0
8685

article-image-how-tflearn-makes-building-tensorflow-models-easier

Savia Lobo

04 Jun 2018

7 min read

How TFLearn makes building TensorFlow models easier

Savia Lobo

04 Jun 2018

7 min read

Today, we will introduce you to TFLearn, and will create layers and models which are directly beneficial in any model implementation with Tensorflow. TFLearn is a modular library in Python that is built on top of core TensorFlow. [box type="note" align="" class="" width=""]This article is an excerpt taken from the book Mastering TensorFlow 1.x written by Armando Fandango. In this book, you will learn how to build TensorFlow models to work with multilayer perceptrons using Keras, TFLearn, and R.[/box] TIP: TFLearn is different from the TensorFlow Learn package which is also known as TF Learn (with one space in between TF and Learn). It is available at the following link; and the source code is available on GitHub. TFLearn can be installed in Python 3 with the following command: pip3 install tflearn Note: To install TFLearn in other environments or from source, please refer to the following link: http://tflearn.org/installation/ The simple workflow in TFLearn is as follows: Create an input layer first. Pass the input object to create further layers. Add the output layer. Create the net using an estimator layer such as regression. Create a model from the net created in the previous step. Train the model with the model.fit() method. Use the trained model to predict or evaluate. Creating the TFLearn Layers Let us learn how to create the layers of the neural network models in TFLearn: Create an input layer first: input_layer = tflearn.input_data(shape=[None,num_inputs] Pass the input object to create further layers: layer1 = tflearn.fully_connected(input_layer,10, activation='relu') layer2 = tflearn.fully_connected(layer1,10, activation='relu') Add the output layer: output = tflearn.fully_connected(layer2,n_classes, activation='softmax') Create the final net from the estimator layer such as regression: net = tflearn.regression(output, optimizer='adam', metric=tflearn.metrics.Accuracy(), loss='categorical_crossentropy' ) The TFLearn provides several classes for layers that are described in following sub-sections. TFLearn core layers TFLearn offers the following layers in the tflearn.layers.core module: Layer classDescriptioninput_dataThis layer is used to specify the input layer for the neural network.fully_connectedThis layer is used to specify a layer where all the neurons are connected to all the neurons in the previous layer.dropoutThis layer is used to specify the dropout regularization. The input elements are scaled by 1/keep_prob while keeping the expected sum unchanged.Layer classDescriptioncustom_layerThis layer is used to specify a custom function to be applied to the input. This class wraps our custom function and presents the function as a layer.reshapeThis layer reshapes the input into the output of specified shape.flattenThis layer converts the input tensor to a 2D tensor.activationThis layer applies the specified activation function to the input tensor.single_unitThis layer applies the linear function to the inputs.highwayThis layer implements the fully connected highway function.one_hot_encodingThis layer converts the numeric labels to their binary vector one-hot encoded representations.time_distributedThis layer applies the specified function to each time step of the input tensor.multi_target_dataThis layer creates and concatenates multiple placeholders, specifically used when the layers use targets from multiple sources. TFLearn convolutional layers TFLearn offers the following layers in the tflearn.layers.conv module: Layer classDescriptionconv_1dThis layer applies 1D convolutions to the input dataconv_2dThis layer applies 2D convolutions to the input dataconv_3dThis layer applies 3D convolutions to the input dataconv_2d_transposeThis layer applies transpose of conv2_d to the input dataconv_3d_transposeThis layer applies transpose of conv3_d to the input dataatrous_conv_2dThis layer computes a 2-D atrous convolutiongrouped_conv_2dThis layer computes a depth-wise 2-D convolutionmax_pool_1dThis layer computes 1-D max poolingmax_pool_2dThis layer computes 2D max poolingavg_pool_1dThis layer computes 1D average poolingavg_pool_2dThis layer computes 2D average poolingupsample_2dThis layer applies the row and column wise 2-D repeat operationupscore_layerThis layer implements the upscore as specified in http://arxiv. org/abs/1411.4038global_max_poolThis layer implements the global max pooling operationglobal_avg_poolThis layer implements the global average pooling operationresidual_blockThis layer implements the residual block to create deep residual networksresidual_bottleneckThis layer implements the residual bottleneck block for deep residual networksresnext_blockThis layer implements the ResNeXt block TFLearn recurrent layers TFLearn offers the following layers in the tflearn.layers.recurrent module: Layer classDescriptionsimple_rnnThis layer implements the simple recurrent neural network modelbidirectional_rnnThis layer implements the bi-directional RNN modellstmThis layer implements the LSTM modelgruThis layer implements the GRU model TFLearn normalization layers TFLearn offers the following layers in the tflearn.layers.normalization module: Layer classDescriptionbatch_normalizationThis layer normalizes the output of activations of previous layers for each batchlocal_response_normalizationThis layer implements the LR normalizationl2_normalizationThis layer applies the L2 normalization to the input tensors TFLearn embedding layers TFLearn offers only one layer in the tflearn.layers.embedding_ops module: Layer classDescriptionembeddingThis layer implements the embedding function for a sequence of integer IDs or floats TFLearn merge layers TFLearn offers the following layers in the tflearn.layers.merge_ops module: Layer classDescriptionmerge_outputsThis layer merges the list of tensors into a single tensor, generally used to merge the output tensors of the same shapemergeThis layer merges the list of tensors into a single tensor; you can specify the axis along which the merge needs to be done TFLearn estimator layers TFLearn offers only one layer in the tflearn.layers.estimator module: Layer classDescriptionregressionThis layer implements the linear or logistic regression While creating the regression layer, you can specify the optimizer and the loss and metric functions. TFLearn offers the following optimizer functions as classes in the tflearn.optimizers module: SGD RMSprop Adam Momentum AdaGrad Ftrl AdaDelta ProximalAdaGrad Nesterov Note: You can create custom optimizers by extending the tflearn.optimizers.Optimizer base class. TFLearn offers the following metric functions as classes or ops in the tflearn.metrics module: Accuracy or accuracy_op Top_k or top_k_op R2 or r2_op WeightedR2 or weighted_r2_op Binary_accuracy_op Note : You can create custom metrics by extending the tflearn.metrics.Metric base class. TFLearn provides the following loss functions, known as objectives, in the tflearn.objectives module: Softymax_categorical_crossentropy categorical_crossentropy binary_crossentropy Weighted_crossentropy mean_square hinge_loss roc_auc_score Weak_cross_entropy_2d While specifying the input, hidden, and output layers, you can specify the activation functions to be applied to the output. TFLearn provides the following activation functions in the tflearn.activations module: linear tanh Sigmoid softmax softplus Softsign relu relu6 leaky_relu Prelu elu Crelu selu Creating the TFLearn Model Create the model from the net created in the previous step (step 4 in creating the TFLearn layers section): model = tflearn.DNN(net) Types of TFLearn models The TFLearn offers two different classes of the models: DNN (Deep Neural Network) model: This class allows you to create a multilayer perceptron from the network that you have created from the layers SequenceGenerator model: This class allows you to create a deep neural network that can generate sequences Training the TFLearn Model After creating, train the model with the model.fit() method: model.fit(X_train, Y_train, n_epoch=n_epochs, batch_size=batch_size, show_metric=True, run_id='dense_model') Using the TFLearn Model Use the trained model to predict or evaluate: score = model.evaluate(X_test, Y_test) print('Test accuracy:', score[0]) The complete code for the TFLearn MNIST classification example is provided in the notebook ch-02_TF_High_Level_Libraries. The output from the TFLearn MNIST example is as follows: Training Step: 5499 | total loss: 0.42119 | time: 1.817s | Adam | epoch: 010 | loss: 0.42119 - acc: 0.8860 -- iter: 54900/55000 Training Step: 5500 | total loss: 0.40881 | time: 1.820s | Adam | epoch: 010 | loss: 0.40881 - acc: 0.8854 -- iter: 55000/55000 -- Test accuracy: 0.9029 Note: You can get more information about TFLearn from the following link: http://tflearn.org/. To summarize, we got to know about TFLearn and the different TFLearn layers and models. If you found this post useful, do check out this book Mastering TensorFlow 1.x, to explore advanced features of TensorFlow 1.x, and gain insight into TensorFlow Core, Keras, TF Estimators, TFLearn, TF Slim, Pretty Tensor, and Sonnet. TensorFlow.js 0.11.1 releases! How to Build TensorFlow Models for Mobile and Embedded devices Distributed TensorFlow: Working with multiple GPUs and servers

0
0
7219

article-image-data-cleaning-worst-part-of-data-analysis

Amey Varangaonkar

04 Jun 2018

5 min read

Data cleaning is the worst part of data analysis, say data scientists

Amey Varangaonkar

04 Jun 2018

5 min read

The year was 2012. Harvard Business Review had famously declared the role of data scientist as the ‘sexiest job of the 21st century’. Companies were slowly working with more data than ever before. The real actionable value of the data that could be used for commercial purposes was slowly beginning to uncover. Someone who could derive these actionable insights from the data was needed. The demand for data scientists was higher than ever. Fast forward to 2018 - more data has been collected in the last 2 years than ever before. Data scientists are still in high demand, and the need for insights is higher than ever. There has been one significant change, though - the process of deriving insights has become more complex. If you ask the data scientists, the first initial phase of this process, which involves data cleansing, has become a lot more cumbersome. So much so, that it is no longer a myth that data scientists spend almost 80% of their time cleaning and readying the data for analysis. Why data cleaning is a nightmare In the recently conducted Packt Skill-Up survey, we asked data professionals what the worst part of the data analysis process was, and a staggering 50% responded with data cleaning. Source: Packt Skill Up Survey We dived deep into this, and tried to understand why many data science professionals have this common feeling of dislike towards data cleaning, or scrubbing - as many call it. Read the Skill Up report in full. Sign up to our weekly newsletter and download the PDF for free. There is no consistent data format Organizations these days work with a lot of data. Some of it is in a structured, readily understandable format. This kind of data is usually quite easy to clean, parse and analyze. However, some of the data is really messy, and cannot be used as is for analysis. This includes missing data, irregularly formatted data, and irrelevant data which is not worth analyzing at all. There is also the problem of working with unstructured data which needs to be pre-processed to get the data worth analyzing. Audio or video files, email messages, presentations, xml documents and web pages are some classic examples of this. There’s too much data to be cleaned The volume of data that businesses deal with on a day to day basis is in the scale of terabytes or even petabytes. Making sense of all this data, coming from a variety of sources and in different formats is, undoubtedly, a huge task. There are a whole host of tools designed to ease this process today, but it remains an incredibly tricky challenge to sift through the large volumes of data and prepare it for analysis. Data cleaning is tricky and time-consuming Data cleansing can be quite an exhaustive and time-consuming task, especially for data scientists. Cleaning the data requires removal of duplications, removing or replacing missing entries, correcting misfielded values, ensuring consistent formatting and a host of other tasks which take a considerable amount of time. Once the data is cleaned, it needs to be placed in a secure location. Also, a log of the entire process needs to be kept to ensure the right data goes through the right process. All of this requires the data scientists to create a well-designed data scrubbing framework to avoid the risk of repetition. All of this is more of a grunt work and requires a lot of manual effort. Sadly, there are no tools in the market which can effectively automate this process. Outsourcing the process is expensive Given that data cleaning is a rather tedious job, many businesses think of outsourcing the task to third party vendors. While this reduces a lot of time and effort on the company’s end, it definitely increases the cost of the overall process. Many small and medium scale businesses may not be able to afford this, and thus are heavily reliant on the data scientist to do the job for them. You can hate it, but you cannot ignore it It is quite obvious that data scientists need clean, ready-to-analyze data if they are to to extract actionable business insights from it. Some data scientists equate data cleaning to donkey work, suggesting there’s not a lot of innovation involved in this process. However, some believe data cleaning is rather important, and pay special attention to it given once it is done right, most of the problems in data analysis are solved. It is very difficult to take advantage of the intrinsic value offered by the dataset if it does not adhere to the quality standards set by the business, making data cleaning a crucial component of the data analysis process. Now that you know why data cleaning is essential, why not dive deeper into the technicalities? Check out our book Practical Data Wrangling for expert tips on turning your noisy data into relevant, insight-ready information using R and Python. Read more Cleaning Data in PDF Files 30 common data science terms explained How to create a strong data science project portfolio that lands you a job

0
2
9299

article-image-visualizing-bigquery-data-with-tableau

Sugandha Lahoti

04 Jun 2018

8 min read

Visualizing BigQuery Data with Tableau

Sugandha Lahoti

04 Jun 2018

8 min read

Tableau is an interactive data visualization tool that can be used to create business intelligence dashboards. Much like most business intelligence tools, it can be used to pull and manipulate data from a number of sources. The difference is its dedication to help users create insightful data visualizations. Tableau's drag-and-drop interface makes it easy for users to explore data via elegant charts. It also includes an in-memory engine in order to speed up calculations on extremely large data sets. In today’s tutorial, we will be using Tableau Desktop for visualizing BigQuery Data. [box type="note" align="" class="" width=""]This article is an excerpt from the book, Learning Google BigQuery, written by Thirukkumaran Haridass and Eric Brown. This book is a comprehensive guide to mastering Google BigQuery to get intelligent insights from your Big Data.[/box] The following section explains how to use Tableau Desktop Edition to connect to BigQuery and get the data from BigQuery to create visuals: After opening Tableau Desktop, select Google BigQuery under the Connect To a Server section on the left; then enter your login credentials for BigQuery: At this point, all the tables in your dataset should be displayed on the left: You can drag and drop the table you are interested in using to the middle section labeled Drop Tables Here. In this case, we want to query the Google Analytics BigQuery test data, so we will click where it says New Custom SQL and enter the following query in the dialog: SELECT trafficsource.medium as Medium, COUNT(visitId) as Visits FROM `google.com:analytics- bigquery.LondonCycleHelmet.ga_sessions_20130910` GROUP BY Medium Now we can click on Update Now to view the first 10,000 rows of our data. We can also do some simple transformations on our columns, such as changing string values to dates and many others. At the bottom, click on the tab titled Sheet 1 to enter the worksheet view. Tableau's interface allows users to simply drag and drop dimensions and metrics from the left side of the report into the central part to create simple text charts, with a feel much like Excel's pivot chart functionality. This makes Tableau easy to transition to for Excel users. From the Dimensions section on the left-hand-side navigation, drag and drop the Medium dimension into the sheet section. Then drag the Visits metric in the Metric section on the left-hand-side navigation to the Text sub-section in the Marks section. This will create a simple text chart with data from the original query: On the right, click on the button marked Show Me. This should bring up a screen with icons for each graph type that can be created in Tableau: Tableau helps by shading graph types that are not available based on the data that is currently selected in the report. It will also make suggestions based on the data available. In this case, a bar chart has been preselected for us as our data is a text dimension and a numeric metric. Click on the bar chart. Once clicked, the default sideways bar chart will appear with the data we have selected. Click on the Swap Rows and Columns in the icon bar at the top of the screen to flip the chart from horizontal to vertical: Map charts in Tableau One of Tableau's strengths is its ease of use when creating a number of different types of charts. This is true when creating maps, especially because maps can be very painful to create using other tools. Here is the way to create a simple map in Tableau using BigQuery public data. The first few steps are the same as in the preceding example: After opening Tableau Desktop, select Google BigQuery under the Connect To a Server section on the left; then enter your login credentials for BigQuery. At this point, all the tables in your dataset should be displayed on the left-hand side. Click where it says New Custom SQL and enter the following query in the dialog: SELECT zipcode, SUM(population) AS population FROM `bigquery-public- data.census_bureau_usa.population_by_zip_2010` GROUP BY zipcode ORDER BY population desc This data is from the United States Census from 2010. The query returns all zip codes in USA, sorted by most populous to least populous. At the bottom, click on the tab titled Sheet 1 to enter the worksheet view. Double-click on the zipcode dimension on the dimensions section on the left navigation. Clicking on a dimension of zip codes (or any other formatted location dimension such as latitude/longitude, country names, state names, and so on) will automatically create a map in Tableau: Drag the population metric from the metrics section on the left navigation and drop it on the color tab in the marks section: The map will now show the most populous zip codes shaded darker than the less populous zip codes. The map chart also includes zoom features in order to make dealing with large maps easy. In the top-left corner of the map, there is a magnifying glass icon. This icons has the map zoom features. Clicking on the arrow at the bottom of this icon opens more features. The icon with a rectangle and a magnifying glass is the selection tool (The first icon to the right of the arrow when hovering over arrow): Click on this icon and then on the map to select a section of the map to be zoomed into: This image is shown after zooming into the California area of the United States. The map now shows the areas of the state that are the most populous. Create a word cloud in Tableau Word clouds are great visualizations for finding words that are most referenced in books, publications, and social media. This section will cover creating a word cloud in Tableau using BigQuery public data. The first few steps are the same as in the preceding example: After opening Tableau Desktop, select Google BigQuery under the Connect To a Server section on the left; then enter your login credentials for BigQuery. At this point, all the tables in your dataset should be displayed on the left. Click where it says New Custom SQL and enter the following query in the dialog: SELECT word, SUM(word_count) word_count FROM `bigquery-public-data.samples.shakespeare` GROUP BY word ORDER BY word_count desc The dataset is from the works of William Shakespeare. The query returns a list of all words in his works, along with a count of the times each word appears in one of his works. At the bottom, click on the tab titled Sheet 1 to enter the worksheet view. In the dimensions section, drag and drop the word dimension into the text tab in the marks section. In the dimensions section, drag and drop the word_count measure to the size tab in the marks section. There will be two tabs used in the marks section. Right-click on the size tab labeled word and select Measure | Count: This will create what is called a tree map. In this example, there are far too many words in the list to utilize the visualization. Drag and drop the word_count measure from the measures section to the filters section. When prompted with How do you want to filter on word_count, select Sum and click on next.. Select At Least for your condition and type 2000 in the dialog. Click on OK. This will return only those words that have a word count of at least 2,000.. Use the dropdown in the marks card to select Text: 11. Drag and drop the word_count measure from the measures section to the color tab in the marks section. This will color each word based on the count for that word: You should be left with a color-coded word cloud. Other charts can now be created as individual worksheet tabs. Tabs can then be combined to make what Tableau calls a dashboard. The process of creating a dashboard here is a bit more cumbersome than creating a dashboard in Google Data Studio, but Tableau offers a great deal of more customization for its dashboards. This, coupled with all the other features it offers, makes Tableau a much more attractive option, especially for enterprise users. We learnt various features of Tableau and how to use it for visualizing BigQuery data.To know about other third party tools for reporting and visualization purposes such as R and Google Data Studio, check out this book Learning Google BigQuery. Tableau is the most powerful and secure end-to-end analytics platform - Interview Insights Tableau 2018.1 brings new features to help organizations easily scale analytics Getting started with Data Visualization in Tableau

0
0
4372

article-image-how-we-think-ai-urge-ai-founding-fathers

Neil Aitken

31 May 2018

9 min read

We must change how we think about AI, urge AI founding fathers

Neil Aitken

31 May 2018

9 min read

0
0
2717

article-image-how-to-build-deep-convolutional-gan-using-tensorflow-and-keras

Savia Lobo

29 May 2018

13 min read

How to build Deep convolutional GAN using TensorFlow and Keras

Savia Lobo

29 May 2018

13 min read

0
0
5800

article-image-optimize-mysql-8-servers-clients

Amey Varangaonkar

28 May 2018

11 min read

How to optimize MySQL 8 servers and clients

Amey Varangaonkar

28 May 2018

11 min read

Our article focuses on optimization for MySQL 8 database servers and clients, we start with optimizing the server, followed by optimizing MySQL 8 client-side entities. It is more relevant to database administrators, to ensure performance and scalability across multiple servers. It would also help developers prepare scripts (which includes setting up the database) and users run MySQL for development and testing to maximize the productivity. [box type="note" align="" class="" width=""]The following excerpt is taken from the book MySQL 8 Administrator’s Guide, written by Chintan Mehta, Ankit Bhavsar, Hetal Oza and Subhash Shah. In this book, authors have presented hands-on techniques for tackling the common and not-so-common issues when it comes to the different administration-related tasks in MySQL 8.[/box] Optimizing disk I/O There are quite a few ways to configure storage devices to devote more and faster storage hardware to the database server. A major performance bottleneck is disk seeking (finding the correct place on the disk to read or write content). When the amount of data grows large enough to make caching impossible, the problem with disk seeds becomes apparent. We need at least one disk seek operation to read, and several disk seek operations to write things in large databases where the data access is done more or less randomly. We should regulate or minimize the disk seek times using appropriate disks. In order to resolve the disk seek performance issue, increasing the number of available disk spindles, symlinking the files to different disks, or stripping disks can be done. The following are the details: Using symbolic links: When using symbolic links, we can create a Unix symbolic links for index and data files. The symlink points from default locations in the data directory to another disk in the case of MyISAM tables. These links may also be striped. This improves the seek and read times. The assumption is that the disk is not used concurrently for other purposes. Symbolic links are not supported for InnoDB tables. However, we can place InnoDB data and log files on different physical disks. Striping: In striping, we have many disks. We put the first block on the first disk, the second block on the second disk, and so on. The N block on the (N % number of-disks) disk. If the stripe size is perfectly aligned, the normal data size will be less than the stripe size. This will help to improve the performance. Striping is dependent on the stripe size and the operating system. In an ideal case, we would benchmark the application with different stripe sizes. The speed difference while striping depends on the parameters we have used, like stripe size. The difference in performance also depends on the number of disks. We have to choose if we want to optimize for random access or sequential access. To gain reliability, we may decide to set up with striping and mirroring (RAID 0+1). RAID stands for Redundant Array of Independent Drives. This approach needs 2 x N drives to hold N drives of data. With a good volume management software, we can manage this setup efficiently. There is another approach to it, as well. Depending on how critical the type of data is, we may vary the RAID level. For example, we can store really important data, such as host information and logs, on a RAID 0+1 or RAID N disk, whereas we can store semi-important data on a RAID 0 disk. In the case of RAID, parity bits are used to ensure the integrity of the data stored on each drive. So, RAID N becomes a problem if we have too many write operations to be performed. The time required to update the parity bits in this case is high. If it is not important to maintain when the file was last accessed, we can mount the file system with the -o noatime option. This option skips the updates on the file system, which reduces the disk seek time. We can also make the file system update asynchronously. Depending upon whether the file system supports it, we can set the -o async option. Using Network File System (NFS) with MySQL While using a Network File System (NFS), varying issues may occur, depending on the operating system and the NFS version. The following are the details: Data inconsistency is one issue with an NFS system. It may occur because of messages received out of order or lost network traffic. We can use TCP with hard and intr mount options to avoid these issues. MySQL data and log files may get locked and become unavailable for use if placed on NFS drives. If multiple instances of MySQL access the same data directory, it may result in locking issues. Improper shut down of MySQL or power outage are other reasons for filesystem locking issues. The latest version of NFS supports advisory and lease-based locking, which helps in addressing the locking issues. Still, it is not recommended to share a data directory among multiple MySQL instances. Maximum file size limitations must be understood to avoid any issues. With NFS 2, only the lower 2 GB of a file is accessible by clients. NFS 3 clients support larger files. The maximum file size depends on the local file system of the NFS server. Optimizing the use of memory In order to improve the performance of database operations, MySQL allocates buffers and caches memory. As a default, the MySQL server starts on a virtual machine (VM) with 512 MB of RAM. We can modify the default configuration for MySQL to run on limited memory systems. The following list describes the ways to optimize MySQL memory: The memory area which holds cached InnoDB data for tables, indexes, and other auxiliary buffers is known as the InnoDB buffer pool. The buffer pool is divided into pages. The pages hold multiple rows. The buffer pool is implemented as a linked list of pages for efficient cache management. Rarely used data is removed from the cache using an algorithm. Buffer pool size is an important factor for system performance. The innodb__buffer_pool_size system variable defines the buffer pool size. InnoDB allocates the entire buffer pool size at server startup. 50 to 75 percent of system memory is recommended for the buffer pool size. With MyISAM, all threads share the key buffer. The key_buffer_size system variable defines the size of the key buffer. The index file is opened once for each MyISAM table opened by the server. For each concurrent thread that accesses the table, the data file is opened once. A table structure, column structures for each column, and a 3 x N sized buffer are allocated for each concurrent thread. The MyISAM storage engine maintains an extra row buffer for internal use. The optimizer estimates the reading of multiple rows by scanning. The storage engine interface enables the optimizer to provide information about the recorded buffer size. The size of the buffer can vary depending on the size of the estimate. In order to take advantage of row pre-fetching, InnoDB uses a variable size buffering capability. It reduces the overhead of latching and B-tree navigation. Memory mapping can be enabled for all MyISAM tables by setting the myisam_use_mmap system variable to 1. The size of an in-memory temporary table can be defined by the tmp_table_size system variable. The maximum size of the heap table can be defined using the max_heap_table_size system variable. If the in-memory table becomes too large, MySQL automatically converts the table from in-memory to on-disk. The storage engine for an on-disk temporary table is defined by the internal_tmp_disk_storage_engine system variable. MySQL comes with the MySQL performance schema. It is a feature to monitor MySQL execution at low levels. The performance schema dynamically allocates memory by scaling its memory use to the actual server load, instead of allocating memory upon server startup. The memory, once allocated, is not freed until the server is restarted. Thread specific space is required for each thread that the server uses to manage client connections. The stack size is governed by the thread_stack system variable. The connection buffer is governed by the net_buffer_length system variable. A result buffer is governed by net_buffer_length. The connection buffer and result buffer starts with net_buffer_length bytes, but enlarges up to max_allowed_packets bytes, as needed. All threads share the same base memory. All join clauses are executed in a single pass. Most of the joins can be executed without a temporary table. Temporary tables are memory-based hash tables. Temporary tables that contain BLOB data and tables with large row lengths are stored on disk. A read buffer is allocated for each request, which performs a sequential scan on a table. The size of the read buffer is determined by the read_buffer_size system variable. MySQL closes all tables that are not in use at once when FLUSH TABLES or mysqladmin flush-table commands are executed. It marks all in-use tables to be closed when the current thread execution finishes. This frees in-use memory. FLUSH TABLES returns only after all tables have been closed. It is possible to monitor the MySQL performance schema and sys schema for memory usage. Before we can execute commands for this, we have to enable memory instruments on the MySQL performance schema. It can be done by updating the ENABLED column of the performance schema setup_instruments table. The following is the query to view available memory instruments in MySQL: mysql> SELECT * FROM performance_schema.setup_instruments WHERE NAME LIKE '%memory%'; This query will return hundreds of memory instruments. We can narrow it down by specifying a code area. The following is an example to limit results to InnoDB memory instruments: mysql> SELECT * FROM performance_schema.setup_instruments WHERE NAME LIKE '%memory/innodb%'; The following is the configuration to enable memory instruments: performance-schema-instrument='memory/%=COUNTED' The following is an example to query memory instrument data in the memory_summary_global_by_event_name table in the performance schema: mysql> SELECT * FROM performance_schema.memory_summary_global_by_event_name WHERE EVENT_NAME LIKE 'memory/innodb/buf_buf_pool'G; EVENT_NAME: memory/innodb/buf_buf_pool COUNT_ALLOC: 1 COUNT_FREE: 0 SUM_NUMBER_OF_BYTES_ALLOC: 137428992 SUM_NUMBER_OF_BYTES_FREE: 0 LOW_COUNT_USED: 0 CURRENT_COUNT_USED: 1 HIGH_COUNT_USED: 1 LOW_NUMBER_OF_BYTES_USED: 0 CURRENT_NUMBER_OF_BYTES_USED: 137428992 HIGH_NUMBER_OF_BYTES_USED: 137428992 It summarizes data by EVENT_NAME. The following is an example of querying the sys schema to aggregate currently allocated memory by code area: mysql> SELECT SUBSTRING_INDEX(event_name,'/',2) AS code_area, sys.format_bytes(SUM(current_alloc)) AS current_alloc FROM sys.x$memory_global_by_current_bytes GROUP BY SUBSTRING_INDEX(event_name,'/',2) ORDER BY SUM(current_alloc) DESC; Performance benchmarking We must consider the following factors when measuring performance: While measuring the speed of a single operation or a set of operations, it is important to simulate a scenario in the case of a heavy database workload for benchmarking In different environments, the test results may be different Depending on the workload, certain MySQL features may not help with performance MySQL 8 supports measuring the performance of individual statements. If we want to measure the speed of any SQL expression or function, the BENCHMARK() function is used. The following is the syntax for the function: BENCHMARK(loop_count, expression) The output of the BENCHMARK function is always zero. The speed can be measured by the line printed by MySQL in the output. The following is an example: mysql> select benchmark(1000000, 1+1); From the preceding example , we can find that the time taken to calculate 1+1 for 1000000 times is 0.15 seconds. Other aspects involved in optimizing MySQL servers and clients include optimizing locking operations, examining thread information and more. To know about these techniques, you may check out the book MySQL 8 Administrator’s Guide. SQL Server recovery models to effectively backup and restore your database Get SQL Server user management right 4 Encryption options for your SQL Server

0
0
8095

article-image-use-m-functions-within-power-bi-querying-data

Amarabha Banerjee

21 May 2018

10 min read

How to use M functions within Microsoft Power BI for querying data

Amarabha Banerjee

21 May 2018

10 min read

Microsoft Power BI Desktop contains a rich set of data source connectors and transformation capabilities that support the integration and enhancement of source data. These features are all driven by a powerful functional language and query engine, M, which leverages source system resources when possible and can greatly extend the scope and robustness of the data retrieval process beyond the possibilities of the standard query editor interface alone. As with almost all BI projects, the design and development of the data access and retrieval process has great implications for the analytical value, scalability, and sustainability of the overall Power BI solution. [box type="note" align="" class="" width=""]Our article is an excerpt from the book Microsoft Power BI Cookbook, written by Brett Powell. This book shows how to leverage Microsoft Power BI and the development tools to create better data driven analytics and visualizations. [/box] In this article, we dive into Power BI Desktop's Get Data experience and go through the process of establishing and managing data source connections and queries. Examples are provided of using the Query Editor interface and the M language directly to construct and refine queries to meet common data transformation and cleansing needs. In practice and as per the examples, a combination of both tools is recommended to aid the query development process. Viewing and analyzing M functions Every time you click on a button to connect to any of Power BI Desktop's supported data sources or apply any transformation to a data source object, such as changing a column's data type, one or multiple M expressions are created reflecting your choices. These M expressions are automatically written to dedicated M documents and, if saved, are stored within the Power BI Desktop file as Queries. M is a functional programming language like F#, and it's important that Power BI developers become familiar with analyzing and later writing and enhancing the M code that supports their queries. Getting ready Build a query through the user interface that connects to the AdventureWorksDW2016CTP3 SQL Server database on the ATLAS server and retrieves the DimGeography table, filtered by United States for English. Click on Get Data from the Home tab of the ribbon, select SQL Server from the list of database sources, and provide the server and database names. For the Data Connectivity mode, select Import. A navigation window will appear, with the different objects and schemas of the database. Select the DimGeography table from the Navigation window and click on Edit. In the Query Editor window, select the EnglishCountryRegionName column and then filter on United States from its dropdown. Figure 2: Filtering for United States only in the Query Editor At this point, a preview of the filtered table is exposed in the Query Editor and the Query Settings pane displays the previous steps. Figure 3: The Query Settings pane in the Query Editor How to do it Formula Bar With the Formula Bar visible in the Query Editor, click on the Source step under Applied Steps in the Query Settings pane. You should see the following formula expression: Figure 4: The SQL.Database() function created for the Source step Click on the Navigation step to expose the following expression: Figure 5: The metadata record created for the Navigation step The navigation expression (2) references the source expression (1) The Formula Bar in the Query Editor displays individual query steps, which are technically individual M expressions It's convenient and very often essential to view and edit all the expressions in a centralized window, and for this, there's the Advanced Editor M is a functional language, and it can be useful to think of query evaluation in M as similar to Excel spreadsheet formulas in which multiple formulas can reference each other. The M engine can determine which expressions are required by the final expression to return and evaluate only those expressions. Configuring Power BI Development Tools, the display setting for both the Query Settings pane and the Formula bar should be enabled as GLOBAL | Query Editor options. Figure 6: Global layout options for the Query Editor Alternatively, on a per file basis, you can control these settings and others from the View tab of the Query Editor toolbar. Figure 7: Property settings of the View tab in the Query Editor Advanced Editor window Given its importance to the query development process, the Advanced Editor dialog is exposed on both the Home and View tabs of the Query Editor. It's recommended to use the Query Editor when getting started with a new query and when learning the M language. After several steps have been applied, use the Advanced Editor to review and optionally enhance or customize the M query. As a rich, functional programming language, there are many M functions and optional parameters not exposed via the Query Editor; going beyond the limits of the Query Editor enables more robust data retrieval and integration processes. Figure 8: The Home tab of the Query Editor Click on Advanced Editor from either the View or Home tabs (Figure 8 and Figure 9, respectively). All M function expressions and any comments are exposed Figure 9: The Advanced Editor view of the DimGeography query When developing retrieval processes for Power BI models, consider these common ETL questions: How are our queries impacting the source systems? Can we make our retrieval queries more resilient to changes in source data such that they avoid failure? Is our retrieval process efficient and simple to follow and support or are there unnecessary steps and queries? Are our retrieval queries delivering sufficient performance to the BI application? Is our process flexible such that we can quickly apply changes to data sources and logic? M queries are not intended as a substitute for the workloads typically handled by enterprise ETL tools such as SSIS or Informatica. However, just as BI professionals would carefully review the logic and test the performance of SQL stored procedures and ETL packages supporting their various cubes and reports environment, they should also review the M queries created to support Power BI models and reports. How it works Two of the top performance and scalability features of M's engine are Query Folding and Lazy Evaluation. If possible, the M queries developed in Power BI Desktop are converted (folded) into SQL statements and passed to source systems for processing. M can also reduce the required resources for a given query by ignoring any unnecessary or redundant steps (variables). M is a case-sensitive language. This includes referencing variables in M expressions (RenameColumns versus Renamecolumns) as well as the values in M queries. For example, the values "Apple" and "apple" are considered unique values in an M query; the Table.Distinct() function will not remove rows for one of the values. Variable names in M expressions cannot have spaces without a hash sign and double quotes. Per Figure 10, when the Query Editor graphical interface is used to create M queries this syntax is applied automatically, along with a name describing the M transformation applied. Applying short, descriptive variable names (with no spaces) improves the readability of M queries. Query folding The query from this recipe was "folded" into the following SQL statement and sent to the ATLAS server for processing. Figure 10: The SQL statement generated from the DimGeography M query Right-click on the Filtered Rows step and select View Native Query to access the Native Query window from Figure 11: Figure 11: View Native Query in Query Settings Finding and revising queries that are not being folded to source systems is a top technique for enhancing large Power BI datasets. See the Pushing Query Processing Back to Source Systems recipe of Chapter 11, Enhancing and Optimizing Existing Power BI Solutions for an example of this process. M query structure The great majority of queries created for Power BI will follow the let...in structure as per this recipe, as they contain multiple steps with dependencies among them. Individual expressions are separated by commas. The expression referred to following the in keyword is the expression returned by the query. The individual step expressions are technically "variables", and if the identifiers for these variables (the names of the query steps) contain spaces then the step is placed in quotes, and prefixed with a # sign as per the Filtered Rows step in Figure 10. Lazy evaluation The M engine also has powerful "lazy evaluation" logic for ignoring any redundant or unnecessary variables, as well as short-circuiting evaluation (computation) once a result is determinate, such as when one side (operand) of an OR logical operator is computed as True. The order of evaluation of the expressions is determined at runtime; it doesn't have to be sequential from top to bottom. In the following example, a step for retrieving Canada was added and the step for the United States was ignored. Since the CanadaOnly variable satisfies the overall let expression of the query, only the Canada query is issued to the server as if the United States row were commented out or didn't exist. Figure 12: Revised query that ignores Filtered Rows step to evaluate Canada only View Native Query (Figure 12) is not available given this revision, but a SQL Profiler trace against the source database server (and a refresh of the M query) confirms that CanadaOnly was the only SQL query passed to the source database. Figure 13: Capturing the SQL statement passed to the server via SQL Server Profiler trace There's more Partial query folding A query can be "partially folded", in which a SQL statement is created resolving only part of an overall query The results of this SQL statement would be returned to Power BI Desktop (or the on-premises data gateway) and the remaining logic would be computed using M's in-memory engine with local resources M queries can be designed to maximize the use of the source system resources, by using standard expressions supported by query folding early in the query process Minimizing the use of local or on-premises data gateway resources is a top consideration Limitations of query folding No folding will take place once a native SQL query has been passed to the source system. For example, passing a SQL query directly through the Get Data dialog. The following query, specified in the Get Data dialog, is included in the Source Step: Figure 14: Providing a user defined native SQL query Any transformations applied after this native query will use local system resources. Therefore, the general implication for query development with native or user-defined SQL queries is that if they're used, try to include all required transformations (that is, joins and derived columns), or use them to utilize an important feature of the source database not being utilized by the folded query, such as an index. Not all data sources support query folding, such as text and Excel files. Not all transformations available in the Query Editor or via M functions directly are supported by some data sources. The privacy levels defined for the data sources will also impact whether folding is used or not. SQL statements are not parsed before they're sent to the source system. The Table.Buffer() function can be used to avoid query folding. The table output of this function is loaded into local memory and transformations against it will remain local. We have discussed effective techniques for accessing and retrieving data using Microsoft Power BI. Do check out this book Microsoft Power BI Cookbook for more information on using Microsoft power BI for data analysis and visualization. Expert Interview: Unlocking the secrets of Microsoft Power BI Tutorial: Building a Microsoft Power BI Data Model Expert Insights:Ride the third wave of BI with Microsoft Power BI

0
0
12172

article-image-getting-started-with-google-data-studio-an-intuitive-tool-for-visualizing-bigquery-data

Sugandha Lahoti

16 May 2018

8 min read

Getting started with Google Data Studio: An intuitive tool for visualizing BigQuery Data

Sugandha Lahoti

16 May 2018

8 min read

Google Data Studio is one of the most popular tools for visualizing data. It can be used to pull data directly out of Google's suite of marketing tools, including Google Analytics, Google AdWords, and Google Search Console. It also supports connectors for database tools such as PostgreSQL and BigQuery, it can be accessed at datastudio.google.com. In this article, we will learn to visualize BigQuery Data with Google Data Studio. [box type="note" align="" class="" width=""]This article is an excerpt from the book, Learning Google BigQuery, written by Thirukkumaran Haridass and Eric Brown. This book will serve as a comprehensive guide to mastering BigQuery, and utilizing it to get useful insights from your Big Data.[/box] The following steps explain how to get started in Google Data Studio and access BigQuery data from Data Studio: Setting up an account: Account setup is extremely easy for Data Studio. Any user with a Google account is eligible to use all Data Studio features for free: Accessing BigQuery data: Once logged in, the next step is to connect to BigQuery. This can be done by clicking on the DATA SOURCES button on the left-hand-side navigation: You'll be prompted to create a data source by clicking on the large plus sign to the bottom-right of the screen. On the right-hand-side navigation, you'll get a list of all of the connectors available to you. Select BigQuery: At this point, you'll be prompted to select from your projects, shared projects, a custom query, or public datasets. Since you are querying the Google Analytics BigQuery Export test data, select Custom Query. Select the project you would like to use. In the Enter Custom Query prompt, add this query and click on the Connect button on the top right: SELECT trafficsource.medium as Medium, COUNT(visitId) as Visits FROM `google.com:analytics- bigquery.LondonCycleHelmet.ga_sessions_20130910` GROUP BY Medium This query will pull the count of sessions for traffic source mediums for the Google Analytics account that has been exported. The next screen shows the schema of the data source you have created. Here, you can make changes to each field of your data, such as changing text fields to date fields or creating calculated metrics: Click on Create Report. Then click on Add to Report. At this point, you will land on your report dashboard. Here, you can begin to create charts using the data you've just pulled from BigQuery. Icons for all the chart types available are shown near the top of the page. Hover over the chart types and click on the chart labeled Bar Chart; then in the grid, hold your right-click button to draw a rectangle. A bar chart should appear, with the Traffic Source Medium and Visit data from the query you ran: A properties prompt should also show on the right-hand side of the page: Here, a number of properties can be selected for your chart, including the dimension, metric, and many style settings. Once you've completed your first chart, more charts can be added to a single page to show other metrics if needed. For many situations, a single bar graph will answer the question at hand. Some situations may require more exploration. In such cases, an analyst might want to know whether the visit metric influences other metrics such as the number of transactions. A scatterplot with visits on the x axis and transactions on the y axis can be used to easily visualize this relationship. Making a scatterplot in Data Studio The following steps show how to make a scatterplot in Data Studio with the data from BigQuery: Update the original query by adding the transaction metric. In the edit screen of your report, click on the bar chart to bring up the chart options on the right-hand- side navigation. Click on the pencil icon next to the data source titled BigQuery to edit the data source. Click on the left-hand-side arrow icon titled Edit Connection: 3. In the dialog titled Enter Custom Query, add this query: SELECT trafficsource.medium as Medium, COUNT(visitId) as Visits, SUM(totals.transactions) AS Transactions FROM `google.com:analytics- bigquery.LondonCycleHelmet.ga_sessions_20130910` GROUP BY Medium Click on the button titled Reconnect in order to reprocess the query. A prompt should emerge, asking whether you'd like to add a new field titled Transactions. Click on Apply. Click on Done. Once you return to the report edit screen, click on the Scatter Chart button() and use your mouse to draw a square in the report space: The report should autoselect the two metrics you've created. Click on the chart to bring up the chart edit screen on the right-hand-side navigation; then click on the Style tab. Click on the dropdown under the Trendline option and select Linear to add a linear trend line, also known as linear regression line. The graph will default to blue, so use the pencil icon on the right to select red as the line color: Making a map in Data Studio Data Studio includes a map chart type that can be used to create simple maps. In order to create maps, a map dimension will need to be included in your data, along with a metric. Here, we will use the Google BigQuery public dataset for Medicare data. You'll need to create a new data source: Accessing BigQuery data: Once logged in, the next step is to connect to BigQuery. This can be done by clicking on the DATA SOURCES button on the left-hand-side navigation. You'll be prompted to create a data source by clicking on the large plus sign to the bottom-right of the screen. On the right-hand-side navigation, you'll get a list of all of the connectors available to you. Select BigQuery. At this point, you'll be prompted to select from your projects, shared projects, a custom query, or public datasets. Since you are querying the Google Analytics BigQuery Export test data, select Custom Query. Select the project you would like to use. In the Enter Custom Query prompt, add this query and click on the Connect button on the top right: SELECT CONCAT(provider_city,", ",provider_state) city, AVG(average_estimated_submitted_charges) avg_sub_charges FROM `bigquery-public-data.medicare.outpatient_charges_2014` WHERE apc = '0267 - Level III Diagnostic and Screening Ultrasound' GROUP BY 1 ORDER BY 2 desc This query will pull the average of submitted charges for diagnostic ultrasounds by city in the United States. This is the most submitted charge in the 2014 Medicaid data. The next screen shows the schema of the data source you have created. Here, you can make changes to each field of your data, such as changing text fields to date fields or creating calculated metrics: Click on Create Report. Then click on Add to Report. At this point, you will land on your report dashboard. Here, you can begin to create charts using the data you've just pulled from BigQuery. Icons for all the chart types available are shown near the top of the page. Hover over the chart types and click on the chart labeled Map Chart; then in the grid, hold your right-click button to draw a rectangle. Click on the chart to bring up the Dimension Picker on the right-hand-side navigation, and click on Create New Dimension: Right click on the City dimension and select the Geo type and City subtype. Here, we can also choose other sub-types (Latitude, Longitude, Metro, Country, and so on). Data Studio will plot the top 500 rows of data (in this case, the top 500 cities in the results set). Hovering over each city brings up detailed data: Data Studio can also be used to roll up geographic data. In this case, we'll roll city data up to state data. From the edit screen, click on the map to bring up the Dimension Picker and click on Create New Dimension in the right-hand-side navigation. Right-click on the City dimension and select the Geo type and Region subtype. Google uses the term Region to signify states: Once completed, the map will be rolled up to the state level instead of the city level. This functionality is very handy when data has not been rolled up prior to being inserted into BigQuery: Other features of Data Studio Filtering: Filtering can be added to your visualizations based on dimensions or metrics as long as the data is available in the data source Data joins: Data for multiple sources can be joined to create new, calculated metrics Turnkey integrations with many Google Marketing Suite tools such as Adwords and Search Console We explored various features of Google Data Studio and learnt to use them for visualizing BigQuery data.To know about other third party tools for reporting and visualization purpose such as R and Tableau, check out the book Learning Google BigQuery. Getting Started with Data Storytelling What is Seaborn and why should you use it for data visualization? Pandas is an effective tool to explore and analyze data - Interview Insights

0
2
9120

How-To Tutorials - Data

Handling backup and recovery in PostgreSQL 10 [Tutorial]

What are data professionals planning to learn this year? Python, deep learning, yes. But also...

Alarming ways governments are using surveillance tech to watch you

5 ways to create a connection to the Qlik Engine [Tip]

How to prevent errors while using utilities for loading data in Teradata

3 ways to use Indexes in Teradata to improve database performance

Implementing feedforward networks with TensorFlow

How TFLearn makes building TensorFlow models easier

Data cleaning is the worst part of data analysis, say data scientists

Visualizing BigQuery Data with Tableau

Trending Topics

We must change how we think about AI, urge AI founding fathers

How to build Deep convolutional GAN using TensorFlow and Keras

How to optimize MySQL 8 servers and clients

How to use M functions within Microsoft Power BI for querying data

Getting started with Google Data Studio: An intuitive tool for visualizing BigQuery Data