Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-machine-learning-ethics-what-you-need-to-know-and-what-you-can-do
Richard Gall
23 Sep 2019
10 min read
Save for later

Machine learning ethics: what you need to know and what you can do

Richard Gall
23 Sep 2019
10 min read
Ethics is, without a doubt, one of the most important topics to emerge in machine learning and artificial intelligence over the last year. While the reasons for this are complex, it nevertheless underlines that the area has reached technological maturity. After all, if artificial intelligence systems weren’t having a real, demonstrable impact on wider society, why would anyone be worried about its ethical implications? It’s easy to dismiss the debate around machine learning and artificial intelligence as abstract and irrelevant to engineers’ and developers’ immediate practical concerns. However this is wrong. Ethics needs to be seen as an important practical consideration for anyone using and building machine learning systems. If we fail to do so the consequences could be serious. The last 12 months has been packed with stories of artificial intelligence not only showing theoretical bias, but also causing discriminatory outcomes in the real world. Amazon scrapped its AI tool for hiring last October because it showed significant bias against female job applicants. Even more recently, last month it emerged that algorithms built to detect hate speech online have in-built biases against black people. Although these might seem like edge cases, it’s vital that everyone in the industry takes responsibility. This isn’t something we can leave up to regulation or other organizations the people who can really affect change are the developers and engineers on the ground. It’s true that machine learning and artificial intelligence systems will be operating in ways where ethics isn’t really an issue - and that’s fine. But by focusing on machine learning ethics, and thinking carefully about the impact of your work you will ultimately end up building better systems that are more robust and have better outcomes. So with that in mind, let’s look at the practical ways to start thinking about ethics in machine learning and artificial intelligence. Machine learning ethics and bias The first step towards thinking seriously about ethics in machine learning is to think about bias. Once you are aware of how bias can creep into machine learning systems, and how that can have ethical implications, it becomes much easier to identify issues and make changes - or, even better, stop them before they arise. Bias isn’t strictly an ethical issue. It could be a performance issue that’s affecting the effectiveness of your system. But in the conversation around AI and machine learning ethics, it’s the most practical way of starting to think seriously about the issue. Types of machine learning and algorithmic bias Although there are a range of different types of bias, the best place to begin is with two top level concepts. You may have read lists of numerous different biases, but for the purpose of talking about ethics there are two important things to think about. Pre-existing and data set biases Pre-existing biases are embedded in the data on which we choose to train algorithms. While it’s true that just about every data set will be ‘biased’ in some way (data is a representation, after all - there will always be something ‘missing), the point here is that we need to be aware of the extent of the bias and the potential algorithmic consequences. You might have heard terms like ‘sampling bias’, ‘exclusion bias’ and ‘prejudice bias’ - these aren’t radically different. They all result from pre-existing biases about how a data set looks or what it represents. Technical and contextual biases Technical machine learning bias is about how an algorithm is programmed. It refers to the problems that arise when an algorithm is built to operate in a specific way. Essentially, it occurs when the programmed elements of an algorithm fail to properly account for the context in which it is being used. A good example is the plagiarism checker Turnitin - this used an algorithm that was trained to identify strings of texts, which meant it would target non-native English speakers over English speaking ones, who were able to make changes to avoid detection. Although there are, as I’ve said, many different biases in the field of machine learning, by thinking about the data on which your algorithm is trained and the context in which the system is working, you will be in a much better place to think about the ethical implications of your work. Equally, you will also be building better systems that don’t cause unforeseen issues. Read next: How to learn data science: from data mining to machine learning The importance of context in machine learning The most important thing for anyone working in machine learning and artificial intelligence is context. Put another way, you need to have a clear sense of why you are trying to do something and what the possible implications could be. If this is unclear, think about it this way: when you use an algorithm, you’re essentially automating away decision making. That’s a good thing when you want to make lots of decisions at a huge scale. But the one thing you lose when turning decision making into a mathematical formula is context. The decisions an algorithm makes lack context because it is programmed to react in a very specific way. This means contextual awareness is your problem. That’s part of the bargain of using an algorithm. Context in data collection Let’s look at what thinking about context means when it comes to your data set. Step 1: what are you trying to achieve? Essentially, the first thing you’ll want to consider is what you’re trying to achieve. Do you want to train an algorithm to recognise faces? Do you want it to understand language in some way? Step 2: why are you doing this? What’s the point of doing what you’re doing? Sometimes this might be a straightforward answer, but be cautious if the answer is too easy to answer. Making something work more efficiently or faster isn’t really a satisfactory reason. What’s the point of making something more efficient? This is often where you’ll start to see ethical issues emerge more clearly. Sometimes they’re not easily resolved. You might not even be in a position to resolve them yourself (if you’re employed by a company, after all, you’re quite literally contracted to perform a specific task). But even if you do feel like there’s little room to maneuver, it’s important to ensure that these discussions actually take place and that you consider the impact of an algorithm. That will make it easier for you to put safeguarding steps in place. Step 3: Understanding the data set Think about how your data set fits alongside the what and the why. Is there anything missing? How was the data collected? Could it be biased or skewed in some way? Indeed, it might not even matter. But if it does, it’s essential that you pay close attention to the data you’re using. It’s worth recording any potential limitations or issues, so if a problem arises at a later stage in your machine learning project, the causes are documented and visible to others. The context of algorithm implementation The other aspect of thinking about context is to think carefully about how your machine learning or artificial intelligence system is being implemented. Is it working how you thought it would? Is it showing any signs of bias? Many articles about the limitations of artificial intelligence and machine learning ethics cite the case of Microsoft’s Tay. Tay was a chatbot that ‘learned’ from its interactions with users on Twitter. Built with considerable naivety, Twitter users turned Tay racist in a matter of days. Users ‘spoke’ to Tay using racist language, and because Tay learned through interactions with Twitter users, the chatbot quickly became a reflection of the language and attitudes of those around it. This is a good example of how the algorithm’s designers didn’t consider how the real-world implementation of the algorithm would have a negative consequence. Despite, you’d think, the best of intentions, the developers didn’t have the foresight to consider the reality of the world into which they were releasing their algorithmic progeny. Read next: Data science vs. machine learning: understanding the difference and what it means today Algorithmic impact assessments It’s true that ethics isn’t always going to be an urgent issue for engineers. But in certain domains, it’s going to be crucial, particularly in public services and other aspects of government, like justice. Maybe there should be a debate about whether artificial intelligence and machine learning should be used in those contexts at all. But if we can’t have that debate, at the very least we can have tools that help us to think about the ethical implications of the machine learning systems we build. This is where Algorithmic Impact Assessments come in. The idea was developed by the AI Now institute and outlined in a paper published last year, and was recently implemented by the Canadian government. There’s no one way to do an algorithmic impact assessment - the Canadian government uses a questionnaire “designed to help you assess and mitigate the risks associated with deploying an automated decision system.” This essentially provides a framework for those using and building algorithms to understand the scope of their project and to identify any potential issues or problems that could arise. Tools for assessing bias and supporting ethical engineering However, although algorithmic impact assessments can provide you with a solid conceptual grounding for thinking about the ethical implications of artificial intelligence and machine learning systems, there are also a number of tools that can help you better understand the ways in which algorithms could be perpetuating biases or prejudices. One of these is FairML, “an end-to- end toolbox for auditing predictive models by quantifying the relative significance of the model's inputs” - helping engineers to identify the extent to which algorithmic inputs could cause harm or bias - while another is LIME (Local Interpretable Model Agnostic Explanations). LIME is not dissimilar to FairML. it aims to understand why an algorithm makes the decisions it does by ‘perturbing’ inputs and seeing how this affects its outputs. There’s also Deon, which is a lot like a more lightweight, developer-friendly version of an algorithmic assessment impact. It’s a command line tool that allows you to add an ethics checklist to your projects. All these tools underline some of the most important elements in the fight for machine learning ethics. FairML and LIME are both attempting to make interpretability easier, while Deon is making it possible for engineers to bring a holistic and critical approach directly into their day to day work. It aims to promote transparency and improve communication between engineers and others. The future of artificial intelligence and machine learning depends on developers taking responsibility Machine learning and artificial intelligence are hitting maturity. They’re technologies that are now, after decades incubated in computer science departments and military intelligence organizations, transforming and having an impact in a truly impressive range of domains. With this maturity comes more responsibility. Ethical questions arise as machine learning affects change everywhere, spilling out into everything from marketing to justice systems. If we can’t get machine learning ethics right, then we’ll never properly leverage the benefits of artificial intelligence and machine learning. People won’t trust it and legislation will start to severely curb what it can do. It’s only by taking responsibility for its effects and consequences that we can be sure it will not only have a transformative impact on the world, but also one that’s safe and for the benefit of everyone.
Read more
  • 0
  • 0
  • 13696

article-image-server-configuration-tuning-postgresql
Packt
18 Oct 2010
10 min read
Save for later

Server Configuration Tuning in PostgreSQL

Packt
18 Oct 2010
10 min read
  PostgreSQL 9.0 High Performance A clear, step-by-step guide to optimizing and scaling up PostgreSQL database servers Learn the right techniques to obtain optimal PostgreSQL database performance, from initial design to routine maintenance Discover the techniques used to scale successful database installations Avoid the common pitfalls that can slow your system down Filled with advice about what you should be doing; how to build experimental databases to explore performance topics, and then move what you've learned into a production database environment Covers versions 8.1 through 9.0           Read more about this book       The main tunable settings for PostgreSQL are in a plain text file named postgresql.conf that's located at the base of the database directory structure. This will often be where $PGDATA is set to on UNIX-like systems, making the file $PGDATA/postgresql.conf on those platforms. This article by Gregory Smith, author of PostgreSQL 9.0 High Performance, mirrors the general format of the official documentation's look at these parameters at http://www.postgresql.org/docs/current/static/runtime-config.html. However, it is more focused on guidelines for setting the most important values, from the perspective of someone interested in performance tuning, rather than describing the meaning of every parameter. This should be considered a supplement to rather than a complete replacement for the extensive material in the manual. Logging General logging setup is important but it is somewhat outside the scope of this article. You may need to set parameters such as log_destination, log_directory, and log_filename to save your log files in a way compatible with the system administrations requirements of your environment. These will all be set to reasonable defaults to get started with on most systems. On UNIX-like systems, it's common for some of the database logging to be set in the script that starts and stops the server, rather than directly in the postgresql.conf file. If you instead use the pg_ctl command to manually start the server, you may discover that logging ends up on your screen instead. You'll need to look at the script that starts the server normally (commonly /etc/init.d/postgresql) to determine what it does, if you want to duplicate that behavior. In most cases, you just need to add –l logfilename to the pg_ctl command line to redirect its output to the standard location. log_line_prefix The default log_line_prefix is empty, which is not what you want. A good starting value here is the following: log_line_prefix='%t:%r:%u@%d:[%p]: ' This will put the following into every log line: %t: Timestamp %u: Database user name %r: Remote host connection is from %d: Database connection is to %p: Process ID of connection It may not be obvious what you'd want all of these values for initially, particularly, the process ID. Once you've tried to chase down a few performance issues, the need for saving these values will be more obvious, and you'll be glad to already have this data logged. Another approach worth considering is setting log_line_prefix such that the resulting logs will be compatible with the pgFouine program. That is a reasonable, general purpose logging prefix, and many sites end up needing to do some sort of query analysis eventually. log_statement The options for this setting are as follows: none: Do not log any statement-level information. ddl: Log only Data Definition Language (DDL) statements such as CREATE and DROP. This can normally be left on even in production, and is handy to catch major changes introduced accidentally or intentionally by administrators. mod: Log any statement that modifies a value, which is essentially everything except for simple SELECT statements. If your workload is mostly SELECT based with relatively few data changes, this may be practical to leave enabled all the time. all: Log every statement. This is generally impractical to leave on in production due to the overhead of the logging. However, if your server is powerful enough relative to its workload, it may be practical to keep it on all the time. Statement logging is a powerful technique for finding performance issues. Analyzing the information saved by log_statement and related sources for statement-level detail can reveal the true source for many types of performance issues. You will need to combine this with appropriate analysis tools. log_min_duration_statement Once you have some idea of how long a typical query statement should take to execute, this setting allows you to log only the ones that exceed some threshold you set. The value is in milliseconds, so you might set: log_min_duration_statement=1000 And then you'll only see statements that take longer than one second to run. This can be extremely handy for finding out the source of "outlier" statements that take much longer than most to execute. If you are running 8.4 or later, you might instead prefer to use the auto_explain module: http://www.postgresql.org/docs/8.4/static/auto-explain.html instead of this feature. This will allow you to actually see why the queries that are running slowly are doing so by viewing their associated EXPLAIN plans. Vacuuming and statistics PostgreSQL databases require two primary forms of regular maintenance as data is added, updated, and deleted. VACUUM cleans up after old transactions, including removing information that is no longer visible and returning freed space to where it can be re-used. The more often you UPDATE and DELETE information from the database, the more likely you'll need a regular vacuum cleaning regime. However, even static tables with data that never changes once inserted still need occasional care here. ANALYZE looks at tables in the database and collects statistics about them— information like estimates of how many rows they have and how many distinct values are in there. Many aspects of query planning depend on this statistics data being accurate. autovacuum As both these tasks are critical to database performance over the long-term, starting in PostgreSQL 8.1 there is an autovacuum daemon available that will run in the background to handle these tasks for you. Its action is triggered by the number of changes to the database exceeding a threshold it calculates based on the existing table size. The parameter for autovacuum is turned on by default in PostgreSQL 8.3, and the default settings are generally aggressive enough to work out of the box for smaller database with little manual tuning. Generally you just need to be careful that the amount of data in the free space map doesn't exceed max_fsm_pages, and even that requirement is automated away from being a concern as of 8.4. Enabling autovacuum on older versions If you have autovacuum available but it's not turned on by default, which will be the case with PostgreSQL 8.1 and 8.2, there are a few related parameters that must also be enabled for it to work, as covered in http://www.postgresql.org/docs/8.1/interactive/maintenance.html or http://www.postgresql.org/docs/8.2/interactive/routine-vacuuming.html. The normal trio to enable in the postgresql.conf file in these versions are: stats_start_collector=truestats_row_level=trueautovacuum=on Note that as warned in the documentation, it's also wise to consider adjusting superuser_reserved_connections to allow for the autovacuum processes in these earlier versions. The autovacuum you'll get in 8.1 and 8.2 is not going to be as efficient as what comes in 8.3 and later. You can expect it to take some fine tuning to get the right balance of enough maintenance without too much overhead, and because there's only a single worker it's easier for it to fall behind on a busy server. This topic isn't covered at length here. It's generally a better idea to put time into planning an upgrade to a PostgreSQL version with a newer autovacuum than to try and tweak an old one extensively, particularly if there are so many other performance issues that cannot be resolved easily in the older versions, too. maintainance_work_mem A few operations in the database server need working memory for larger operations than just regular sorting. VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY all can allocate up to maintainance_work_mem worth of memory instead. As it's unlikely that many sessions will be doing one of these operations at once, it's possible to set this value much higher than the standard per-client work_mem setting. Note that at least autovacuum_max_workers (defaulting to 3 starting in version 8.3) will allocate this much memory, so consider those sessions (perhaps along with a session or two doing a CREATE INDEX) when setting this value. Assuming you haven't increased the number of autovacuum workers, a typical high setting for this value on a modern server would be at five percent of the total RAM, so that even five such processes wouldn't exceed a quarter of available memory. This works out to approximately 50 MB of maintainance_work_mem per GB of server RAM. default_statistics_target PostgreSQL makes its decisions about how queries execute based on statistics collected about each table in your database. This information is collected by analyzing the tables, either with the ANALYZE statement or via autovacuum doing that step. In either case, the amount of information collected during the analyze step is set by default_statistics_target. Increasing this value makes analysis take longer, and as analysis of autovacuum happens regularly this turns into increased background overhead for database maintenance. But if there aren't enough statistics about a table, you can get bad plans for queries against it. The default value for this setting used to be the very low (that is,10), but was increased to 100 in PostgreSQL 8.4. Using that larger value was popular in earlier versions, too, for general improved query behavior. Indexes using the LIKE operator tended to work much better with values greater than 100 rather than below it, due to a hard-coded change at that threshold. Note that increasing this value does result in a net slowdown on your system if you're not ever running queries where the additional statistics result in a change to a better query plan. This is one reason why some simple benchmarks show PostgreSQL 8.4 as slightly slower than 8.3 at default parameters for each, and in some cases you might return an 8.4 install to a smaller setting. Extremely large settings for default_statistics_target are discouraged due to the large overhead they incur. If there is just a particular column in a table you know that needs better statistics, you can use ALTER TABLE SET STATISTICS on that column to adjust this setting just for it. This works better than increasing the system-wide default and making every table pay for that requirement. Typically, the columns that really require a lot more statistics to work properly will require a setting near the maximum of 1000 (increased to 10,000 in later versions) to get a serious behavior change, which is far higher than you'd want to collect data for on every table in the database.
Read more
  • 0
  • 0
  • 13537

article-image-how-to-perform-audio-video-image-scraping-with-python
Amarabha Banerjee
08 Mar 2018
9 min read
Save for later

How to perform Audio-Video-Image Scraping with Python

Amarabha Banerjee
08 Mar 2018
9 min read
[box type="note" align="" class="" width=""]Our article is an excerpt from the book Web Scraping with Python, written by Richard Lawson. This book contains step by step tutorials on how to leverage Python programming techniques for ethical web scraping. [/box] A common practice in scraping is the download, storage, and further processing of media content (non-web pages or data files). This media can include images, audio, and video. To store the content locally (or in a service like S3) and to do it correctly, we need to know what is the type of media, and it isn’t enough to trust the file extension in the URL. Hence, we will learn how to download and correctly represent the media type based on information from the web server. Another common task is the generation of thumbnails of images, videos, or even a page of a website. We will examine several techniques of how to generate thumbnails and make website page screenshots. Many times these are used on a new website as thumbnail links to the scraped media which is stored locally. Finally, it is often the need to be able to transcode media, such as converting non-MP4 videos to MP4, or changing the bit-rate or resolution of a video. Another scenario is to extract only the audio from a video file. We won't look at video transcoding, but we will rip MP3 audio out of an MP4 file using ffmpeg. It's a simple step from there to also transcode video with ffmpeg. Downloading media content from the web Downloading media content from the web is a simple process: use Requests or another library and download it just like you would HTML content. Getting ready There is a class named URLUtility in the urls.py module in the util folder of the solution. This class handles several of the scenarios in this chapter with downloading and parsing URLs. We will be using this class in this recipe and a few others. Make sure the modules folder is in your Python path. Also, the example for this recipe is in the 04/01_download_image.py file. How to do it Here is how we proceed with the recipe: The URLUtility class can download content from a URL. The code in the recipe's file is the following: import const from util.urls import URLUtility util = URLUtility(const.ApodEclipseImage()) print(len(util.data)) When running this you will see the following output:  Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes 171014 The example reads 171014 bytes of data. How it works The URL is defined as a constant const.ApodEclipseImage() in the const module: def ApodEclipseImage(): return "https://apod.nasa.gov/apod/image/1709/BT5643s.jpg" The constructor of the URLUtility class has the following implementation: def __init__(self, url, readNow=True): """ Construct the object, parse the URL, and download now if specified""" self._url = url self._response = None self._parsed = urlparse(url) if readNow: self.read() The constructor stores the URL, parses it, and downloads the file with the read() method. The following is the code of the read() method: def read(self): self._response = urllib.request.urlopen(self._url) self._data = self._response.read() This function uses urlopen to get a response object, and then reads the stream and stores it as a property of the object. That data can then be retrieved using the data property: @property def data(self): self.ensure_response() return self._data The code then simply reports on the length of that data, with the value of 171014. There's more This class will be used for other tasks such as determining content types, filename, and extensions for those files. We will examine parsing of URLs for filenames next. Parsing a URL with urllib to get the filename When downloading content from a URL, we often want to save it in a file. Often it is good enough to save the file in a file with a name found in the URL. But the URL consists of a number of fragments, so how can we find the actual filename from the URL, especially where there are often many parameters after the file name? Getting ready We will again be using the URLUtility class for this task. The code file for the recipe is 04/02_parse_url.py. How to do it Execute the recipe's file with your python interpreter. It will run the following code: util = URLUtility(const.ApodEclipseImage()) print(util.filename_without_ext) This results in the following output: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes The filename is: BT5643s How it works In the constructor for URLUtility, there is a call to urlib.parse.urlparse. The following demonstrates using the function interactively: >>> parsed = urlparse(const.ApodEclipseImage()) >>> parsed ParseResult(scheme='https', netloc='apod.nasa.gov', path='/apod/image/1709/BT5643s.jpg', params='', query='', fragment='') The ParseResult object contains the various components of the URL. The path element contains the path and the filename. The call to the .filename_without_ext property returns just the filename without the extension: @property def filename_without_ext(self): filename = os.path.splitext(os.path.basename(self._parsed.path))[0] return filename The call to os.path.basename returns only the filename portion of the path (including the extension). os.path.splittext() then separates the filename and the extension, and the function returns the first element of that tuple/list (the filename). There's more It may seem odd that this does not also return the extension as part of the filename. This is because we cannot assume that the content that we received actually matches the implied type from the extension. It is more accurate to determine this using headers returned by the web server. That's our next recipe. Determining the type of content for a URL When performing a GET requests for content from a web server, the web server will return a number of headers, one of which identities the type of the content from the perspective of the web server. In this recipe we learn to use that to determine what the web server considers the type of the content. Getting ready We again use the URLUtility class. The code for the recipe is in 04/03_determine_content_type_from_response.py. How to do it We proceed as follows: Execute the script for the recipe. It contains the following code: util = URLUtility(const.ApodEclipseImage()) print("The content type is: " + util.contenttype) With the following result: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes The content type is: image/jpeg How it works The .contentype property is implemented as follows: @property def contenttype(self): self.ensure_response() return self._response.headers['content-type'] The .headers property of the _response object is a dictionary-like class of headers. The content-type key will retrieve the content-type specified by the server. This call to the ensure_response() method simply ensures that the .read() function has been executed. There's more The headers in a response contain a wealth of information. If we look more closely at the headers property of the response, we can see the following headers are returned: >>> response = urllib.request.urlopen(const.ApodEclipseImage()) >>> for header in response.headers: print(header) Date Server Last-Modified ETag Accept-Ranges Content-Length Connection Content-Type Strict-Transport-Security And we can see the values for each of these headers. >>> for header in response.headers: print(header + " ==> " + response.headers[header]) Date ==> Tue, 26 Sep 2017 19:31:41 GMT Server ==> WebServer/1.0 Last-Modified ==> Thu, 31 Aug 2017 20:26:32 GMT ETag ==> "547bb44-29c06-5581275ce2b86" Accept-Ranges ==> bytes Content-Length ==> 171014 Connection ==> close Content-Type ==> image/jpeg Strict-Transport-Security ==> max-age=31536000; includeSubDomains Many of these we will not examine in this book, but for the unfamiliar it is good to know that they exist. Determining the file extension from a content type It is good practice to use the content-type header to determine the type of content, and to determine the extension to use for storing the content as a file. Getting ready We again use the URLUtility object that we created. The recipe's script is 04/04_determine_file_extension_from_contenttype.py):. How to do it Proceed by running the recipe's script. An extension for the media type can be found using the .extension property: util = URLUtility(const.ApodEclipseImage()) print("Filename from content-type: " + util.extension_from_contenttype) print("Filename from url: " + util.extension_from_url) This results in the following output: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes Filename from content-type: .jpg Filename from url: .jpg This reports both the extension determined from the file type, and also from the URL. These can be different, but in this case they are the same. How it works The following is the implementation of the .extension_from_contenttype property: @property def extension_from_contenttype(self): self.ensure_response() map = const.ContentTypeToExtensions() if self.contenttype in map: return map[self.contenttype] return None The first line ensures that we have read the response from the URL. The function then uses a python dictionary, defined in the const module, which contains a dictionary of content types to extension: def ContentTypeToExtensions(): return { "image/jpeg": ".jpg", "image/jpg": ".jpg", "image/png": ".png" } If the content type is in the dictionary, then the corresponding value will be returned. Otherwise, None is returned. Note the corresponding property, .extension_from_url: @property def extension_from_url(self): ext = os.path.splitext(os.path.basename(self._parsed.path))[1] return ext This uses the same technique as the .filename property to parse the URL, but instead returns the [1] element, which represents the extension instead of the base filename. To summarize, we discussed how effectively we can scrap audio, video and image content from the web using Python. If you liked our post, be sure to check out Web Scraping with Python, which gives more information on performing web scraping efficiently with Python.
Read more
  • 0
  • 0
  • 13522
Visually different images

article-image-managing-hadoop-cluster
Packt
30 Aug 2013
13 min read
Save for later

Managing a Hadoop Cluster

Packt
30 Aug 2013
13 min read
(For more resources related to this topic, see here.) From the perspective of functionality, a Hadoop cluster is composed of an HDFS cluster and a MapReduce cluster . The HDFS cluster consists of the default filesystem for Hadoop. It has one or more NameNodes to keep track of the filesystem metadata, while actual data blocks are stored on distributed slave nodes managed by DataNode. Similarly, a MapReduce cluster has one JobTracker daemon on the master node and a number of TaskTrackers on the slave nodes. The JobTracker manages the life cycle of MapReduce jobs. It splits jobs into smaller tasks and schedules the tasks to run by the TaskTrackers. A TaskTracker executes tasks assigned by the JobTracker in parallel by forking one or a number of JVM processes. As a Hadoop cluster administrator, you will be responsible for managing both the HDFS cluster and the MapReduce cluster. In general, system administrators should maintain the health and availability of the cluster. More specifically, for an HDFS cluster, it means the management of the NameNodes and DataNodes and the management of the JobTrackers and TaskTrackers for MapReduce. Other administrative tasks include the management of Hadoop jobs, for example configuring job scheduling policy with schedulers. Managing the HDFS cluster The health of HDFS is critical for a Hadoop-based Big Data platform. HDFS problems can negatively affect the efficiency of the cluster. Even worse, it can make the cluster not function properly. For example, DataNode's unavailability caused by network segmentation can lead to some under-replicated data blocks. When this happens, HDFS will automatically replicate those data blocks, which will bring a lot of overhead to the cluster and cause the cluster to be too unstable to be available for use. In this recipe, we will show commands to manage an HDFS cluster. Getting ready Before getting started, we assume that our Hadoop cluster has been properly configured and all the daemons are running without any problems. Log in to the master node from the administrator machine with the following command: ssh hduser@master How to do it... Use the following steps to check the status of an HDFS cluster with hadoop fsck: Check the status of the root filesystem with the following command: hadoop fsck / We will get an output similar to the following: FSCK started by hduser from /10.147.166.55 for path / at Thu Feb 28 17:14:11 EST 2013 .. /user/hduser/.staging/job_201302281211_0002/job.jar: Under replicated blk_-665238265064328579_1016. Target Replicas is 10 but found 5 replica(s). .................................Status: HEALTHY Total size: 14420321969 B Total dirs: 22 Total files: 35 Total blocks (validated): 241 (avg. block size 59835360 B) Minimally replicated blocks: 241 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 2 (0.8298755 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 2 Average block replication: 2.0248964 Corrupt blocks: 0 Missing replicas: 10 (2.0491803 %) Number of data-nodes: 5 Number of racks: 1 FSCK ended at Thu Feb 28 17:14:11 EST 2013 in 28 milliseconds The filesystem under path '/' is HEALTHY The output shows that some percentage of data blocks is under-replicated. But because HDFS can automatically make duplication for those data blocks, the HDFS filesystem and the '/' directory are both HEALTHY. Check the status of all the files on HDFS with the following command: hadoop fsck / -files We will get an output similar to the following: FSCK started by hduser from /10.147.166.55 for path / at Thu Feb 28 17:40:35 EST 2013 / <dir> /home <dir> /home/hduser <dir> /home/hduser/hadoop <dir> /home/hduser/hadoop/tmp <dir> /home/hduser/hadoop/tmp/mapred <dir> /home/hduser/hadoop/tmp/mapred/system <dir> /home/hduser/hadoop/tmp/mapred/system/jobtracker.info 4 bytes, 1 block(s): OK /user <dir> /user/hduser <dir> /user/hduser/randtext <dir> /user/hduser/randtext/_SUCCESS 0 bytes, 0 block(s): OK /user/hduser/randtext/_logs <dir> /user/hduser/randtext/_logs/history <dir> /user/hduser/randtext/_logs/history/job_201302281451_0002_13620904 21087_hduser_random-text-writer 23995 bytes, 1 block(s): OK /user/hduser/randtext/_logs/history/job_201302281451_0002_conf.xml 22878 bytes, 1 block(s): OK /user/hduser/randtext/part-00001 1102231864 bytes, 17 block(s): OK Status: HEALTHY Hadoop will scan and list all the files in the cluster. This command scans all ? les on HDFS and prints the size and status. Check the locations of file blocks with the following command: hadoop fsck / -files -locations The output of this command will contain the following information: The first line tells us that file part-00000 has 17 blocks in total and each block has 2 replications (replication factor has been set to 2). The following lines list the location of each block on the DataNode. For example, block blk_6733127705602961004_1127 has been replicated on hosts 10.145.231.46 and 10.145.223.184. The number 50010 is the port number of the DataNode. Check the locations of file blocks containing rack information with the following command: hadoop fsck / -files -blocks -racks Delete corrupted files with the following command: hadoop fsck -delete Move corrupted files to /lost+found with the following command: hadoop fsck -move Use the following steps to check the status of an HDFS cluster with hadoop dfsadmin: Report the status of each slave node with the following command: hadoop dfsadmin -report The output will be similar to the following: Configured Capacity: 422797230080 (393.76 GB) Present Capacity: 399233617920 (371.82 GB) DFS Remaining: 388122796032 (361.47 GB) DFS Used: 11110821888 (10.35 GB) DFS Used%: 2.78% Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 ------------------------------------------------- Datanodes available: 5 (5 total, 0 dead) Name: 10.145.223.184:50010 Decommission Status : Normal Configured Capacity: 84559446016 (78.75 GB) DFS Used: 2328719360 (2.17 GB) Non DFS Used: 4728565760 (4.4 GB) DFS Remaining: 77502160896(72.18 GB) DFS Used%: 2.75% DFS Remaining%: 91.65% Last contact: Thu Feb 28 20:30:11 EST 2013 ... The first section of the output shows the summary of the HDFS cluster, including the configured capacity, present capacity, remaining capacity, used space, number of under-replicated data blocks, number of data blocks with corrupted replicas, and number of missing blocks. The following sections of the output information show the status of each HDFS slave node, including the name (ip:port) of the DataNode machine, commission status, configured capacity, HDFS and non-HDFS used space amount, HDFS remaining space, and the time that the slave node contacted the master. Refresh all the DataNodes using the following command: hadoop dfsadmin -refreshNodes Check the status of the safe mode using the following command: hadoop dfsadmin -safemode get We will be able to get the following output: Safe mode is OFF The output tells us that the NameNode is not in safe mode. In this case, the filesystem is both readable and writable. If the NameNode is in safe mode, the filesystem will be read-only (write protected). Manually put the NameNode into safe mode using the following command: hadoop dfsadmin -safemode enter This command is useful for system maintenance. Make the NameNode to leave safe mode using the following command: hadoop dfsadmin -safemode leave If the NameNode has been in safe mode for a long time or it has been put into safe mode manually, we need to use this command to let the NameNode leave this mode. Wait until NameNode leaves safe mode using the following command: hadoop dfsadmin -safemode wait This command is useful when we want to wait until HDFS finishes data block replication or wait until a newly commissioned DataNode to be ready for service. Save the metadata of the HDFS filesystem with the following command: hadoop dfsadmin -metasave meta.log The meta.log file will be created under the directory $HADOOP_HOME/logs. Its content will be similar to the following: 21 files and directories, 88 blocks = 109 total Live Datanodes: 5 Dead Datanodes: 0 Metasave: Blocks waiting for replication: 0 Metasave: Blocks being replicated: 0 Metasave: Blocks 0 waiting deletion from 0 datanodes. Metasave: Number of datanodes: 5 10.145.223.184:50010 IN 84559446016(78.75 GB) 2328719360(2.17 GB) 2.75% 77502132224(72.18 GB) Thu Feb 28 21:43:52 EST 2013 10.152.166.137:50010 IN 84559446016(78.75 GB) 2357415936(2.2 GB) 2.79% 77492854784(72.17 GB) Thu Feb 28 21:43:52 EST 2013 10.145.231.46:50010 IN 84559446016(78.75 GB) 2048004096(1.91 GB) 2.42% 77802893312(72.46 GB) Thu Feb 28 21:43:54 EST 2013 10.152.161.43:50010 IN 84559446016(78.75 GB) 2250854400(2.1 GB) 2.66% 77600096256(72.27 GB) Thu Feb 28 21:43:52 EST 2013 10.152.175.122:50010 IN 84559446016(78.75 GB) 2125828096(1.98 GB) 2.51% 77724323840(72.39 GB) Thu Feb 28 21:43:53 EST 2013 21 files and directories, 88 blocks = 109 total ... How it works... The HDFS filesystem will be write protected when NameNode enters safe mode. When an HDFS cluster is started, it will enter safe mode first. The NameNode will check the replication factor for each data block. If the replica count of a data block is smaller than the configured value, which is 3 by default, the data block will be marked as under-replicated. Finally, an under-replication factor, which is the percentage of under-replicated data blocks, will be calculated. If the percentage number is larger than the threshold value, the NameNode will stay in safe mode until enough new replicas are created for the under-replicated data blocks so as to make the under-replication factor lower than the threshold. We can get the usage of the fsck command using: hadoop fsck The usage information will be similar to the following: Usage: DFSck <path> [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]] <path> start checking from this path -move move corrupted files to /lost+found -delete delete corrupted files -files print out files being checked -openforwrite print out files opened for write -blocks print out block report -locations print out locations for every block -racks print out network topology for data-node locations By default fsck ignores files opened for write, use -openforwrite to report such files. They are usually tagged CORRUPT or HEALTHY depending on their block allocation status.   We can get the usage of the dfsadmin command using: hadoop dfsadmin The output will be similar to the following: Usage: java DFSAdmin [-report] [-safemode enter | leave | get | wait] [-saveNamespace] [-refreshNodes] [-finalizeUpgrade] [-upgradeProgress status | details | force] [-metasave filename] [-refreshServiceAcl] [-refreshUserToGroupsMappings] [-refreshSuperUserGroupsConfiguration] [-setQuota <quota> <dirname>...<dirname>] [-clrQuota <dirname>...<dirname>] [-setSpaceQuota <quota> <dirname>...<dirname>] [-clrSpaceQuota <dirname>...<dirname>] [-setBalancerBandwidth <bandwidth in bytes per second>] [-help [cmd]] There's more… Besides using command line, we can use the web UI to check the status of an HDFS cluster. For example, we can get the status information of HDFS by opening the link http://master:50070/dfshealth.jsp. We will get a web page that shows the summary of the HDFS cluster such as the configured capacity and remaining space. For example, the web page will be similar to the following screenshot: By clicking on the Live Nodes link, we can check the status of each DataNode. We will get a web page similar to the following screenshot: By clicking on the link of each node, we can browse the directory of the HDFS filesystem. The web page will be similar to the following screenshot: The web page shows that file /user/hduser/randtext has been split into five partitions. We can browse the content of each partition by clicking on the part-0000x link. Configuring SecondaryNameNode Hadoop NameNode is a single point of failure. By configuring SecondaryNameNode, the filesystem image and edit log files can be backed up periodically. And in case of NameNode failure, the backup files can be used to recover the NameNode. In this recipe, we will outline steps to configure SecondaryNameNode. Getting ready We assume that Hadoop has been configured correctly. Log in to the master node from the cluster administration machine using the following command: ssh hduser@master How to do it... Perform the following steps to configure SecondaryNameNode: Stop the cluster using the following command: stop-all.sh Add or change the following into the file $HADOOP_HOME/conf/hdfs-site.xml: <property> <name>fs.checkpoint.dir</name> <value>/hadoop/dfs/namesecondary</value> </property> If this property is not set explicitly, the default checkpoint directory will be ${hadoop.tmp.dir}/dfs/namesecondary. Start the cluster using the following command: start-all.sh The tree structure of the NameNode data directory will be similar to the following: ${dfs.name.dir}/ ├── current │ ├── edits │ ├── fsimage │ ├── fstime │ └── VERSION ├── image │ └── fsimage ├── in_use.lock └── previous.checkpoint ├── edits ├── fsimage ├── fstime └── VERSION And the tree structure of the SecondaryNameNode data directory will be similar to the following: ${fs.checkpoint.dir}/ ├── current │ ├── edits │ ├── fsimage │ ├── fstime │ └── VERSION ├── image │ └── fsimage └── in_use.lock There's more... To increase redundancy, we can configure NameNode to write filesystem metadata on multiple locations. For example, we can add an NFS shared directory for backup by changing the following property in the file $HADOOP_HOME/conf/hdfs-site.xml: <property> <name>dfs.name.dir</name> <value>/hadoop/dfs/name,/nfs/name</value> </property> Managing the MapReduce cluster A typical MapReduce cluster is composed of one master node that runs the JobTracker and a number of slave nodes that run TaskTrackers. The task of managing a MapReduce cluster includes maintaining the health as well as the membership between TaskTrackers and the JobTracker. In this recipe, we will outline commands to manage a MapReduce cluster. Getting ready We assume that the Hadoop cluster has been properly configured and running. Log in to the master node from the cluster administration machine using the following command: ssh hduser@master How to do it... Perform the following steps to manage a MapReduce cluster: List all the active TaskTrackers using the following command: hadoop -job -list-active-trackers This command can help us check the registration status of the TaskTrackers in the cluster. Check the status of the JobTracker safe mode using the following command: hadoop mradmin -safemode get We will get the following output: Safe mode is OFF The output tells us that the JobTracker is not in safe mode. We can submit jobs to the cluster. If the JobTracker is in safe mode, no jobs can be submitted to the cluster. Manually let the JobTracker enter safe mode using the following command: hadoop mradmin -safemode enter This command is handy when we want to maintain the cluster. Let the JobTracker leave safe mode using the following command: hadoop mradmin -safemode leave When maintenance tasks are done, you need to run this command. If we want to wait for safe mode to exit, the following command can be used: hadoop mradmin -safemode wait Reload the MapReduce queue configuration using the following command: hadoop mradmin -refreshQueues Reload active TaskTrackers using the following command: hadoop mradmin -refreshNodes How it works... Get the usage of the mradmin command using the following: hadoop mradmin The usage information will be similar to the following: Usage: java MRAdmin [-refreshServiceAcl] [-refreshQueues] [-refreshUserToGroupsMappings] [-refreshSuperUserGroupsConfiguration] [-refreshNodes] [-safemode <enter | leave | get | wait>] [-help [cmd]] ... The meaning of the command options is listed in the following table: Option Description -refreshServiceAcl Force JobTracker to reload service ACL. -refreshQueues Force JobTracker to reload queue configurations. -refreshUserToGroupsMappings Force JobTracker to reload user group mappings. -refreshSuperUserGroupsConfiguration Force JobTracker to reload super user group mappings. -refreshNodes Force JobTracker to refresh the JobTracker hosts. -help [cmd] Show the help info for a command or all commands. Summary In this article, we learned Managing the HDFS cluster, configuring SecondaryNameNode, and managing the MapReduce cluster. As a Hadoop cluster administrator, as the system administrator is responsible for managing both the HDFS cluster and the MapReduce cluster, he/she must be aware of how to manage these in order to maintain the health and availability of the cluster. More specifically, for an HDFS cluster, it means the management of the NameNodes and DataNodes and the management of the JobTrackers and TaskTrackers for MapReduce, which is covered in this article. Resources for Article : Further resources on this subject: Analytics – Drawing a Frequency Distribution with MapReduce (Intermediate) [Article] Advanced Hadoop MapReduce Administration [Article] Understanding MapReduce [Article]
Read more
  • 0
  • 0
  • 13243

article-image-create-an-rnn-based-python-machine-translation-system-tutorial
Sunith Shetty
20 Jul 2018
22 min read
Save for later

Create an RNN based Python machine translation system [Tutorial]

Sunith Shetty
20 Jul 2018
22 min read
Machine translation is a process which uses neural network techniques to automatically translate text from one language to the another, with no human intervention required. In today’s machine learning tutorial, we will understand the architecture and learn how to train and build your own machine translation system. This project will help us automatically translate German to produce English sentences. This article is an excerpt from a book written by Luca Massaron, Alberto Boschetti,  Alexey Grigorev, Abhishek Thakur, and Rajalingappaa Shanmugamani titled TensorFlow Deep Learning Projects. Walkthrough of the architecture A machine translation system receives as input an arbitrary string in one language and produces, as output, a string with the same meaning but in another language. Google Translate is one example (but also many other main IT companies have their own). There, users are able to translate to and from more than 100 languages. Using the webpage is easy: on the left just put the sentence you want to translate (for example, Hello World), select its language (in the example, it's English), and select the language you want it to be translated to. Here's an example where we translate the sentence Hello World to French: Is it easy? At a glance, we may think it's a simple dictionary substitution. Words are chunked, the translation is looked up on the specific English-to-French dictionary, and each word is substituted with its translation. Unfortunately, that's not the case. In the example, the English sentence has two words, while the French one has three. More generically, think about phrasal verbs (turn up, turn off, turn on, turn down), Saxon genitive, grammatical gender, tenses, conditional sentences... they don't always have a direct translation, and the correct one should follow the context of the sentence. That's why, for doing machine translation, we need some artificial intelligence tools. Specifically, as for many other natural language processing (NLP) tasks, we'll be using recurrent neural networks (RNNs).  The main feature they have is that they work on sequences: given an input sequence, they produce an output sequence. The objective of this article is to create the correct training pipeline for having a sentence as the input sequence, and its translation as the output one. Remember also the no free lunch theorem: this process isn't easy, and more solutions can be created with the same result. Here, in this article, we will propose a simple but powerful one. First of all, we start with the corpora: it's maybe the hardest thing to find since it should contain a high fidelity translation of many sentences from a language to another one. Fortunately, NLTK, a well-known package of Python for NLP, contains the corpora Comtrans. Comtrans is the acronym of combination approach to machine translation and contains an aligned corpus for three languages: German, French, and English. In this project, we will use these corpora for a few reasons, as follows: It's easy to download and import in Python. No preprocessing is needed to read it from disk / from the internet. NLTK already handles that part. It's small enough to be used on many laptops (a few dozen thousands sentences). It's freely available on the internet. For more information about the Comtrans project, go to http://www.fask.uni-mainz.de/user/rapp/comtrans/. More specifically, we will try to create a machine translation system to translate German to English. We picked these two languages at random among the ones available in the Comtrans corpora: feel free to flip them, or use the French corpora instead. The pipeline of our project is generic enough to handle any combination. Let's now investigate how the corpora is organized by typing some commands: from nltk.corpus import comtrans print(comtrans.aligned_sents('alignment-de-en.txt')[0]) The output is as follows: <AlignedSent: 'Wiederaufnahme der S...' -> 'Resumption of the se...'> The pairs of sentences are available using the function aligned_sents. The filename contains the from and to language. In this case, as for the following part of the project, we will translate German (de) to English (en). The returned object is an instance of the class nltk.translate.api.AlignedSent. By looking at the documentation, the first language is accessible with the attribute words, while the second language is accessible with the attribute mots. So, to extract the German sentence and its English translation separately, we should run: print(comtrans.aligned_sents()[0].words) print(comtrans.aligned_sents()[0].mots) The preceding code outputs: ['Wiederaufnahme', 'der', 'Sitzungsperiode'] ['Resumption', 'of', 'the', 'session'] How nice! The sentences are already tokenized, and they look as sequences. In fact, they will be the input and (hopefully) the output of the RNN which will provide the service of machine translation from German to English for our project. Furthermore, if you want to understand the dynamics of the language, Comtrans makes available the alignment of the words in the translation: print(comtrans.aligned_sents()[0].alignment) The preceding code outputs: 0-0 1-1 1-2 2-3 The first word in German is translated to the first word in English (Wiederaufnahme to Resumption), the second to the second (der to both of and the), and the third (at index 1) is translated with the fourth (Sitzungsperiode to session). Pre-processing of the corpora The first step is to retrieve the corpora. We've already seen how to do this, but let's now formalize it in a function. To make it generic enough, let's enclose these functions in a file named corpora_tools.py. Let's do some imports that we will use later on: import pickle import re from collections import Counter from nltk.corpus import comtrans Now, let's create the function to retrieve the corpora: def retrieve_corpora(translated_sentences_l1_l2='alignment-de-en.txt'): print("Retrieving corpora: {}".format(translated_sentences_l1_l2)) als = comtrans.aligned_sents(translated_sentences_l1_l2) sentences_l1 = [sent.words for sent in als] sentences_l2 = [sent.mots for sent in als] return sentences_l1, sentences_l2 This function has one argument; the file containing the aligned sentences from the NLTK Comtrans corpora. It returns two lists of sentences (actually, they're a list of tokens), one for the source language (in our case, German), the other in the destination language (in our case, English). On a separate Python REPL, we can test this function: sen_l1, sen_l2 = retrieve_corpora() print("# A sentence in the two languages DE & EN") print("DE:", sen_l1[0]) print("EN:", sen_l2[0]) print("# Corpora length (i.e. number of sentences)") print(len(sen_l1)) assert len(sen_l1) == len(sen_l2) The preceding code creates the following output: Retrieving corpora: alignment-de-en.txt # A sentence in the two languages DE & EN DE: ['Wiederaufnahme', 'der', 'Sitzungsperiode'] EN: ['Resumption', 'of', 'the', 'session'] # Corpora length (i.e. number of sentences) 33334 We also printed the number of sentences in each corpora (33,000) and asserted that the number of sentences in the source and the destination languages is the same. In the following step, we want to clean up the tokens. Specifically, we want to tokenize punctuation and lowercase the tokens. To do so, we can create a new function in corpora_tools.py. We will use the regex module to perform the further splitting tokenization: def clean_sentence(sentence): regex_splitter = re.compile("([!?.,:;$"')( ])") clean_words = [re.split(regex_splitter, word.lower()) for word in sentence] return [w for words in clean_words for w in words if words if w] Again, in the REPL, let's test the function: clean_sen_l1 = [clean_sentence(s) for s in sen_l1] clean_sen_l2 = [clean_sentence(s) for s in sen_l2] print("# Same sentence as before, but chunked and cleaned") print("DE:", clean_sen_l1[0]) print("EN:", clean_sen_l2[0]) The preceding code outputs the same sentence as before, but chunked and cleaned: DE: ['wiederaufnahme', 'der', 'sitzungsperiode'] EN: ['resumption', 'of', 'the', 'session'] Nice! The next step for this project is filtering the sentences that are too long to be processed. Since our goal is to perform the processing on a local machine, we should limit ourselves to sentences up to N tokens. In this case, we set N=20, in order to be able to train the learner within 24 hours. If you have a powerful machine, feel free to increase that limit. To make the function generic enough, there's also a lower bound with a default value set to 0, such as an empty token set. The logic of the function is very easy: if the number of tokens for a sentence or its translation is greater than N, then the sentence (in both languages) is removed: def filter_sentence_length(sentences_l1, sentences_l2, min_len=0, max_len=20): filtered_sentences_l1 = [] filtered_sentences_l2 = [] for i in range(len(sentences_l1)): if min_len <= len(sentences_l1[i]) <= max_len and min_len <= len(sentences_l2[i]) <= max_len: filtered_sentences_l1.append(sentences_l1[i]) filtered_sentences_l2.append(sentences_l2[i]) return filtered_sentences_l1, filtered_sentences_l2 Again, let's see in the REPL how many sentences survived this filter. Remember, we started with more than 33,000: filt_clean_sen_l1, filt_clean_sen_l2 = filter_sentence_length(clean_sen_l1, clean_sen_l2) print("# Filtered Corpora length (i.e. number of sentences)") print(len(filt_clean_sen_l1)) assert len(filt_clean_sen_l1) == len(filt_clean_sen_l2) The preceding code prints the following output: # Filtered Corpora length (i.e. number of sentences) 14788 Almost 15,000 sentences survived, that is, half of the corpora. Now, we finally move from text to numbers (which AI mainly uses). To do so, we shall create a dictionary of the words for each language. The dictionary should be big enough to contain most of the words, though we can discard some if the language has words with low occourrence. This is a common practice even in the tf-idf (term frequency within a document, multiplied by the inverse of the document frequency, i.e. in how many documents that token appears), where very rare words are discarded to speed up the computation, and make the solution more scalable and generic. We need here four special symbols in both dictionaries: One symbol for padding (we'll see later why we need it) One symbol for dividing the two sentences One symbol to indicate where the sentence stops One symbol to indicate unknown words (like the very rare ones) For doing so, let's create a new file named data_utils.py containing the following lines of code: _PAD = "_PAD" _GO = "_GO" _EOS = "_EOS" _UNK = "_UNK" _START_VOCAB = [_PAD, _GO, _EOS, _UNK] PAD_ID = 0 GO_ID = 1 EOS_ID = 2 UNK_ID = 3 OP_DICT_IDS = [PAD_ID, GO_ID, EOS_ID, UNK_ID] Then, back to the corpora_tools.py file, let's add the following function: import data_utils def create_indexed_dictionary(sentences, dict_size=10000, storage_path=None): count_words = Counter() dict_words = {} opt_dict_size = len(data_utils.OP_DICT_IDS) for sen in sentences: for word in sen: count_words[word] += 1 dict_words[data_utils._PAD] = data_utils.PAD_ID dict_words[data_utils._GO] = data_utils.GO_ID dict_words[data_utils._EOS] = data_utils.EOS_ID dict_words[data_utils._UNK] = data_utils.UNK_ID for idx, item in enumerate(count_words.most_common(dict_size)): dict_words[item[0]] = idx + opt_dict_size if storage_path: pickle.dump(dict_words, open(storage_path, "wb")) return dict_words This function takes as arguments the number of entries in the dictionary and the path of where to store the dictionary. Remember, the dictionary is created while training the algorithms: during the testing phase it's loaded, and the association token/symbol should be the same one as used in the training. If the number of unique tokens is greater than the value set, only the most popular ones are selected. At the end, the dictionary contains the association between a token and its ID for each language. After building the dictionary, we should look up the tokens and substitute them with their token ID. For that, we need another function: def sentences_to_indexes(sentences, indexed_dictionary): indexed_sentences = [] not_found_counter = 0 for sent in sentences: idx_sent = [] for word in sent: try: idx_sent.append(indexed_dictionary[word]) except KeyError: idx_sent.append(data_utils.UNK_ID) not_found_counter += 1 indexed_sentences.append(idx_sent) print('[sentences_to_indexes] Did not find {} words'.format(not_found_counter)) return indexed_sentences This step is very simple; the token is substituted with its ID. If the token is not in the dictionary, the ID of the unknown token is used. Let's see in the REPL how our sentences look after these steps: dict_l1 = create_indexed_dictionary(filt_clean_sen_l1, dict_size=15000, storage_path="/tmp/l1_dict.p") dict_l2 = create_indexed_dictionary(filt_clean_sen_l2, dict_size=10000, storage_path="/tmp/l2_dict.p") idx_sentences_l1 = sentences_to_indexes(filt_clean_sen_l1, dict_l1) idx_sentences_l2 = sentences_to_indexes(filt_clean_sen_l2, dict_l2) print("# Same sentences as before, with their dictionary ID") print("DE:", list(zip(filt_clean_sen_l1[0], idx_sentences_l1[0]))) This code prints the token and its ID for both the sentences. What's used in the RNN will be just the second element of each tuple, that is, the integer ID: # Same sentences as before, with their dictionary ID DE: [('wiederaufnahme', 1616), ('der', 7), ('sitzungsperiode', 618)] EN: [('resumption', 1779), ('of', 8), ('the', 5), ('session', 549)] Please also note how frequent tokens, such as the and of in English, and der in German, have a low ID. That's because the IDs are sorted by popularity (see the body of the function create_indexed_dictionary). Even though we did the filtering to limit the maximum size of the sentences, we should create a function to extract the maximum size. For the lucky owners of very powerful machines, which didn't do any filtering, that's the moment to see how long the longest sentence in the RNN will be. That's simply the function: def extract_max_length(corpora): return max([len(sentence) for sentence in corpora]) Let's apply the following to our sentences: max_length_l1 = extract_max_length(idx_sentences_l1) max_length_l2 = extract_max_length(idx_sentences_l2) print("# Max sentence sizes:") print("DE:", max_length_l1) print("EN:", max_length_l2) As expected, the output is: # Max sentence sizes: DE: 20 EN: 20 The final preprocessing step is padding. We need all the sequences to be the same length, therefore we should pad the shorter ones. Also, we need to insert the correct tokens to instruct the RNN where the string begins and ends. Basically, this step should: Pad the input sequences, for all being 20 symbols long Pad the output sequence, to be 20 symbols long Insert an _GO at the beginning of the output sequence and an _EOS at the end to position the start and the end of the translation This is done by this function (insert it in the corpora_tools.py): def prepare_sentences(sentences_l1, sentences_l2, len_l1, len_l2): assert len(sentences_l1) == len(sentences_l2) data_set = [] for i in range(len(sentences_l1)): padding_l1 = len_l1 - len(sentences_l1[i]) pad_sentence_l1 = ([data_utils.PAD_ID]*padding_l1) + sentences_l1[i] padding_l2 = len_l2 - len(sentences_l2[i]) pad_sentence_l2 = [data_utils.GO_ID] + sentences_l2[i] + [data_utils.EOS_ID] + ([data_utils.PAD_ID] * padding_l2) data_set.append([pad_sentence_l1, pad_sentence_l2]) return data_set To test it, let's prepare the dataset and print the first sentence: data_set = prepare_sentences(idx_sentences_l1, idx_sentences_l2, max_length_l1, max_length_l2) print("# Prepared minibatch with paddings and extra stuff") print("DE:", data_set[0][0]) print("EN:", data_set[0][1]) print("# The sentence pass from X to Y tokens") print("DE:", len(idx_sentences_l1[0]), "->", len(data_set[0][0])) print("EN:", len(idx_sentences_l2[0]), "->", len(data_set[0][1])) The preceding code outputs the following: # Prepared minibatch with paddings and extra stuff DE: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1616, 7, 618] EN: [1, 1779, 8, 5, 549, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] # The sentence pass from X to Y tokens DE: 3 -> 20 EN: 4 -> 22 As you can see, the input and the output are padded with zeros to have a constant length (in the dictionary, they correspond to _PAD, see data_utils.py), and the output contains the markers 1 and 2 just before the start and the end of the sentence. As proven effective in the literature, we're going to pad the input sentences at the start and the output sentences at the end. After this operation, all the input sentences are 20 items long, and the output sentences 22. Training the machine translator So far, we've seen the steps to preprocess the corpora, but not the model used. The model is actually already available on the TensorFlow Models repository, freely downloadable from https://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/seq2seq_model.py. The piece of code is licensed with Apache 2.0. We really thank the authors for having open sourced such a great model. Copyright 2015 The TensorFlow Authors. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the License); You may not use this file except in compliance with the License. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software. Distributed under the License is distributed on an AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. We will see the usage of the model throughout this section. First, let's create a new file named train_translator.py and put in some imports and some constants. We will save the dictionary in the /tmp/ directory, as well as the model and its checkpoints: import time import math import sys import pickle import glob import os import tensorflow as tf from seq2seq_model import Seq2SeqModel from corpora_tools import * path_l1_dict = "/tmp/l1_dict.p" path_l2_dict = "/tmp/l2_dict.p" model_dir = "/tmp/translate " model_checkpoints = model_dir + "/translate.ckpt" Now, let's use all the tools created in the previous section within a function that, given a Boolean flag, returns the corpora. More specifically, if the argument is False, it builds the dictionary from scratch (and saves it); otherwise, it uses the dictionary available in the path: def build_dataset(use_stored_dictionary=False): sen_l1, sen_l2 = retrieve_corpora() clean_sen_l1 = [clean_sentence(s) for s in sen_l1] clean_sen_l2 = [clean_sentence(s) for s in sen_l2] filt_clean_sen_l1, filt_clean_sen_l2 = filter_sentence_length(clean_sen_l1, clean_sen_l2) if not use_stored_dictionary: dict_l1 = create_indexed_dictionary(filt_clean_sen_l1, dict_size=15000, storage_path=path_l1_dict) dict_l2 = create_indexed_dictionary(filt_clean_sen_l2, dict_size=10000, storage_path=path_l2_dict) else: dict_l1 = pickle.load(open(path_l1_dict, "rb")) dict_l2 = pickle.load(open(path_l2_dict, "rb")) dict_l1_length = len(dict_l1) dict_l2_length = len(dict_l2) idx_sentences_l1 = sentences_to_indexes(filt_clean_sen_l1, dict_l1) idx_sentences_l2 = sentences_to_indexes(filt_clean_sen_l2, dict_l2) max_length_l1 = extract_max_length(idx_sentences_l1) max_length_l2 = extract_max_length(idx_sentences_l2) data_set = prepare_sentences(idx_sentences_l1, idx_sentences_l2, max_length_l1, max_length_l2) return (filt_clean_sen_l1, filt_clean_sen_l2), data_set, (max_length_l1, max_length_l2), (dict_l1_length, dict_l2_length) This function returns the cleaned sentences, the dataset, the maximum length of the sentences, and the lengths of the dictionaries. Also, we need to have a function to clean up the model. Every time we run the training routine we need to clean up the model directory, as we haven't provided any garbage information. We can do this with a very simple function: def cleanup_checkpoints(model_dir, model_checkpoints): for f in glob.glob(model_checkpoints + "*"): os.remove(f) try: os.mkdir(model_dir) except FileExistsError: pass Finally, let's create the model in a reusable fashion: def get_seq2seq_model(session, forward_only, dict_lengths, max_sentence_lengths, model_dir): model = Seq2SeqModel( source_vocab_size=dict_lengths[0], target_vocab_size=dict_lengths[1], buckets=[max_sentence_lengths], size=256, num_layers=2, max_gradient_norm=5.0, batch_size=64, learning_rate=0.5, learning_rate_decay_factor=0.99, forward_only=forward_only, dtype=tf.float16) ckpt = tf.train.get_checkpoint_state(model_dir) if ckpt and tf.train.checkpoint_exists(ckpt.model_checkpoint_path): print("Reading model parameters from {}".format(ckpt.model_checkpoint_path)) model.saver.restore(session, ckpt.model_checkpoint_path) else: print("Created model with fresh parameters.") session.run(tf.global_variables_initializer()) return model This function calls the constructor of the model, passing the following parameters: The source vocabulary size (German, in our example) The target vocabulary size (English, in our example) The buckets (in our example is just one, since we padded all the sequences to a single size) The long short-term memory (LSTM) internal units size The number of stacked LSTM layers The maximum norm of the gradient (for gradient clipping) The mini-batch size (that is, how many observations for each training step) The learning rate The learning rate decay factor The direction of the model The type of data (in our example, we will use flat16, that is, float using 2 bytes) To make the training faster and obtain a model with good performance, we have already set the values in the code; feel free to change them and see how it performs. The final if/else in the function retrieves the model, from its checkpoint, if the model already exists. In fact, this function will be used in the decoder too to retrieve and model on the test set. Finally, we have reached the function to train the machine translator. Here it is: def train(): with tf.Session() as sess: model = get_seq2seq_model(sess, False, dict_lengths, max_sentence_lengths, model_dir) # This is the training loop. step_time, loss = 0.0, 0.0 current_step = 0 bucket = 0 steps_per_checkpoint = 100 max_steps = 20000 while current_step < max_steps: start_time = time.time() encoder_inputs, decoder_inputs, target_weights = model.get_batch([data_set], bucket) _, step_loss, _ = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket, False) step_time += (time.time() - start_time) / steps_per_checkpoint loss += step_loss / steps_per_checkpoint current_step += 1 if current_step % steps_per_checkpoint == 0: perplexity = math.exp(float(loss)) if loss < 300 else float("inf") print ("global step {} learning rate {} step-time {} perplexity {}".format( model.global_step.eval(), model.learning_rate.eval(), step_time, perplexity)) sess.run(model.learning_rate_decay_op) model.saver.save(sess, model_checkpoints, global_step=model.global_step) step_time, loss = 0.0, 0.0 encoder_inputs, decoder_inputs, target_weights = model.get_batch([data_set], bucket) _, eval_loss, _ = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket, True) eval_ppx = math.exp(float(eval_loss)) if eval_loss < 300 else float("inf") print(" eval: perplexity {}".format(eval_ppx)) sys.stdout.flush() The function starts by creating the model. Also, it sets some constants on the steps per checkpoints and the maximum number of steps. Specifically, in the code, we will save a model every 100 steps and we will perform no more than 20,000 steps. If it still takes too long, feel free to kill the program: every checkpoint contains a trained model, and the decoder will use the most updated one. At this point, we enter the while loop. For each step, we ask the model to get a minibatch of data (of size 64, as set previously). The method get_batch returns the inputs (that is, the source sequence), the outputs (that is, the destination sequence), and the weights of the model. With the method step, we run one step of the training. One piece of information returned is the loss for the current minibatch of data. That's all the training! To report the performance and store the model every 100 steps, we print the average perplexity of the model (the lower, the better) on the 100 previous steps, and we save the checkpoint. The perplexity is a metric connected to the uncertainty of the predictions: the more confident we're about the tokens, the lower will be the perplexity of the output sentence. Also, we reset the counters and we extract the same metric from a single minibatch of the test set (in this case, it's a random minibatch of the dataset), and performances of it are printed too. Then, the training process restarts again. As an improvement, every 100 steps we also reduce the learning rate by a factor. In this case, we multiply it by 0.99. This helps the convergence and the stability of the training. We now have to connect all the functions together. In order to create a script that can be called by the command line but is also used by other scripts to import functions, we can create a main, as follows: if __name__ == "__main__": _, data_set, max_sentence_lengths, dict_lengths = build_dataset(False) cleanup_checkpoints(model_dir, model_checkpoints) train() In the console, you can now train your machine translator system with a very simple command: $> python train_translator.py On an average laptop, without an NVIDIA GPU, it takes more than a day to reach a perplexity below 10 (12+ hours). This is the output: Retrieving corpora: alignment-de-en.txt [sentences_to_indexes] Did not find 1097 words [sentences_to_indexes] Did not find 0 words Created model with fresh parameters. global step 100 learning rate 0.5 step-time 4.3573073434829713 perplexity 526.6638556683066 eval: perplexity 159.2240770935855 [...] global step 10500 learning rate 0.180419921875 step-time 4.35106209993362414 perplexity 2.0458043055629487 eval: perplexity 1.8646006006241982 [...] In this article, we've seen how to create a machine translation system based on an RNN. We've seen how to organize the corpus, and how to train it. To know more about how to test and translate the model, do checkout this book TensorFlow Deep Learning Projects. Google’s translation tool is now offline – and more powerful than ever thanks to AI Anatomy of an automated machine learning algorithm (AutoML) FAE (Fast Adaptation Engine): iOlite’s tool to write Smart Contracts using machine translation
Read more
  • 0
  • 1
  • 13167

article-image-create-conversational-assistant-chatbot-using-python
Savia Lobo
21 Feb 2018
5 min read
Save for later

How to create a conversational assistant or chatbot using Python

Savia Lobo
21 Feb 2018
5 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Natural Language Processing with Python Cookbook written by Krishna Bhavsar, Naresh Kumar, and Pratap Dangeti. This book includes unique recipes to teach various aspects of performing Natural Language Processing with NLTK—the leading Python platform for the task.[/box] Today we will learn to create a conversational assistant or chatbot using Python programming language. Conversational assistants or chatbots are not very new. One of the foremost of this kind is ELIZA, which was created in the early 1960s and is worth exploring. In order to successfully build a conversational engine, it should take care of the following things: 1. Understand the target audience 2. Understand the natural language in which communication happens.  3. Understand the intent of the user 4. Come up with responses that can answer the user and give further clues NLTK has a module, nltk.chat, which simplifies building these engines by providing a generic framework. Let's see the available engines in NLTK: Engines Modules Eliza nltk.chat.eliza Python module Iesha nltk.chat.iesha Python module Rude nltk.chat.rudep ython module Suntsu Suntsu nltk.chat.suntsu module Zen nltk.chat.zen module In order to interact with these engines we can just load these modules in our Python program and invoke the demo() function. This recipe will show us how to use built-in engines and also write our own simple conversational engine using the framework provided by the nltk.chat module. Getting ready You should have Python installed, along with the nltk library. Having an understanding of regular expressions also helps. How to do it...    Open atom editor (or your favorite programming editor).    Create a new file called Conversational.py.    Type the following source code:    Save the file.    Run the program using the Python interpreter.    You will see the following output: How it works... Let's try to understand what we are trying to achieve here. import nltk This instruction imports the nltk library into the current program. def builtinEngines(whichOne): This instruction defines a new function called builtinEngines that takes a string parameter, whichOne: if whichOne == 'eliza': nltk.chat.eliza.demo() elif whichOne == 'iesha': nltk.chat.iesha.demo() elif whichOne == 'rude': nltk.chat.rude.demo() elif whichOne == 'suntsu': nltk.chat.suntsu.demo() elif whichOne == 'zen': nltk.chat.zen.demo() else: print("unknown built-in chat engine {}".format(whichOne)) These if, elif, else instructions are typical branching instructions that decide which chat engine's demo() function is to be invoked depending on the argument that is present in the whichOne variable. When the user passes an unknown engine name, it displays a message to the user that it's not aware of this engine. It's a good practice to handle all known and unknown cases also; it makes our programs more robust in handling unknown situations def myEngine():. This instruction defines a new function called myEngine(); this function does not take any parameters. chatpairs = ( (r"(.*?)Stock price(.*)", ("Today stock price is 100", "I am unable to find out the stock price.")), (r"(.*?)not well(.*)", ("Oh, take care. May be you should visit a doctor", "Did you take some medicine ?")), (r"(.*?)raining(.*)", ("Its monsoon season, what more do you expect ?", "Yes, its good for farmers")), (r"How(.*?)health(.*)", ("I am always healthy.", "I am a program, super healthy!")), (r".*", ("I am good. How are you today ?", "What brings you here ?")) ) This is a single instruction where we are defining a nested tuple data structure and assigning it to chat pairs. Let's pay close attention to the data structure: We are defining a tuple of tuples Each subtuple consists of two elements: The first member is a regular expression (this is the user's question in regex format) The second member of the tuple is another set of tuples (these are the answers) def chat(): print("!"*80) print(" >> my Engine << ") print("Talk to the program using normal english") print("="*80) print("Enter 'quit' when done") chatbot = nltk.chat.util.Chat(chatpairs, nltk.chat.util.reflections) chatbot.converse() We are defining a subfunction called chat()inside the myEngine() function. This is permitted in Python. This chat() function displays some information to the user on the screen and calls the nltk built-in nltk.chat.util.Chat() class with the chatpairs variable. It passes nltk.chat.util.reflections as the second argument. Finally we call the chatbot.converse() function on the object that's created using the chat() class. chat() This instruction calls the chat() function, which shows a prompt on the screen and accepts the user's requests. It shows responses according to the regular expressions that we have built before: if   name    == '  main  ': for engine in ['eliza', 'iesha', 'rude', 'suntsu', 'zen']: print("=== demo of {} ===".format(engine)) builtinEngines(engine) print() myEngine() These instructions will be called when the program is invoked as a standalone program (not using import). They do these two things: Invoke the built-in engines one after another (so that we can experience them) Once all the five built-in engines are excited, they call our myEngine(), where our customer engine comes into play We have learned to create a chatbot of our own using the easiest programming language ‘Python’. To know more about how to efficiently use NLTK and implement text classification, identify parts of speech, tag words, etc check out Natural Language Processing with Python Cookbook.
Read more
  • 0
  • 0
  • 13149
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at £15.99/month. Cancel anytime
article-image-crud-create-read-update-delete-operations-elasticsearch
Pravin Dhandre
19 Feb 2018
5 min read
Save for later

CRUD (Create Read, Update and Delete) Operations with Elasticsearch

Pravin Dhandre
19 Feb 2018
5 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Pranav Shukla and Sharath Kumar M N titled Learning Elastic Stack 6.0. This book is for beginners who want to start performing distributed search analytics and visualization using core functionalities of Elasticsearch, Kibana and Logstash.[/box] In this tutorial, we will look at how to perform basic CRUD operations using Elasticsearch. Elasticsearch has a very well designed REST API, and the CRUD operations are targeted at documents. To understand how to perform CRUD operations, we will cover the following APIs. These APIs fall under the category of Document APIs that deal with documents: Index API Get API Update API Delete API Index API In Elasticsearch terminology, adding (or creating) a document into a type within an index of Elasticsearch is called an indexing operation. Essentially, it involves adding the document to the index by parsing all fields within the document and building the inverted index. This is why this operation is known as an indexing operation. There are two ways we can index a document: Indexing a document by providing an ID Indexing a document without providing an ID Indexing a document by providing an ID We have already seen this version of the indexing operation. The user can provide the ID of the document using the PUT method. The format of this request is PUT /<index>/<type>/<id>, with the JSON document as the body of the request: PUT /catalog/product/1 { "sku": "SP000001", "title": "Elasticsearch for Hadoop", "description": "Elasticsearch for Hadoop", "author": "Vishal Shukla", "ISBN": "1785288997", "price": 26.99 } Indexing a document without providing an ID If you don't want to control the ID generation for the documents, you can use the POST method. The format of this request is POST /<index>/<type>, with the JSON document as the body of the request: POST /catalog/product { "sku": "SP000003", "title": "Mastering Elasticsearch", "description": "Mastering Elasticsearch", "author": "Bharvi Dixit", "price": 54.99 } The ID in this case will be generated by Elasticsearch. It is a hash string, as highlighted in the response: { "_index": "catalog", "_type": "product", "_id": "AVrASKqgaBGmnAMj1SBe", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "created": true } As per pure REST conventions, POST is used for creating a new resource and PUT is used for updating an existing resource. Here, the usage of PUT is equivalent to saying I know the ID that I want to assign, so use this ID while indexing this document. Get API The Get API is useful for retrieving a document when you already know the ID of the document. It is essentially a get by primary key operation: GET /catalog/product/AVrASKqgaBGmnAMj1SBe The format of this request is GET /<index>/<type>/<id>. The response would be as Expected: { "_index": "catalog", "_type": "product", "_id": "AVrASKqgaBGmnAMj1SBe", "_version": 1, "found": true, "_source": { "sku": "SP000003", "title": "Mastering Elasticsearch", "description": "Mastering Elasticsearch", "author": "Bharvi Dixit", "price": 54.99 } } Update API The Update API is useful for updating the existing document by ID. The format of an update request is POST <index>/<type>/<id>/_update with a JSON request as the body: POST /catalog/product/1/_update { "doc": { "price": "28.99" } } The properties specified under the "doc" element are merged into the existing document. The previous version of this document with ID 1 had price of 26.99. This update operation just updates the price and leaves the other fields of the document unchanged. This type of update means "doc" is specified and used as a partial document to merge with an existing document; there are other types of updates supported. The response of the update request is as follows: { "_index": "catalog", "_type": "product", "_id": "1", "_version": 2, "result": "updated", "_shards": { "total": 2, "successful": 1, "failed": 0 } } Internally, Elasticsearch maintains the version of each document. Whenever a document is updated, the version number is incremented. The partial update that we have seen above will work only if the document existed beforehand. If the document with the given id did not exist, Elasticsearch will return an error saying that document is missing. Let us understand how do we do an upsert operation using the Update API. The term upsert loosely means update or insert, i.e. update the document if it exists otherwise insert new document. The parameter doc_as_upsert checks if the document with the given id already exists and merges the provided doc with the existing document. If the document with the given id doesn't exist, it inserts a new document with the given document contents. The following example uses doc_as_upsert to merge into the document with id 3 or insert a new document if it doesn't exist. POST /catalog/product/3/_update { "doc": { "author": "Albert Paro", "title": "Elasticsearch 5.0 Cookbook", "description": "Elasticsearch 5.0 Cookbook Third Edition", "price": "54.99" }, "doc_as_upsert": true } We can update the value of a field based on the existing value of that field or another field in the document. The following update uses an inline script to increase the price by two for a specific product: POST /catalog/product/AVrASKqgaBGmnAMj1SBe/_update { "script": { "inline": "ctx._source.price += params.increment", "lang": "painless", "params": { "increment": 2 } } } Scripting support allows for the reading of the existing value, incrementing the value by a variable, and storing it back in a single operation. The inline script used here is Elasticsearch's own painless scripting language. The syntax for incrementing an existing variable is similar to most other programming languages. Delete API The Delete API lets you delete a document by ID:  DELETE /catalog/product/AVrASKqgaBGmnAMj1SBe  The response of the delete operations is as follows: { "found": true, "_index": "catalog", "_type": "product", "_id": "AVrASKqgaBGmnAMj1SBe", "_version": 4, "result": "deleted", "_shards": { "total": 2, "successful": 1, "failed": 0 } } This is how basic CRUD operations are performed with Elasticsearch using simple document APIs from any data source in any format securely and reliably. If you found this tutorial useful, do check out the book Learning Elastic Stack 6.0  and start building end-to-end real-time data processing solutions for your enterprise analytics applications.
Read more
  • 0
  • 0
  • 13098

article-image-configuring-mysql-linked-server-sql-server-2008
Packt Editorial Staff
16 Aug 2009
7 min read
Save for later

Configuring a MySQL linked server on SQL Server 2008

Packt Editorial Staff
16 Aug 2009
7 min read
Linking servers provides an elegant solution when you are faced with running queries against databases on distributed servers or looking at your distributed assets on disparate databases. This article by Dr. Jay Krishnaswamy explains how to set up a MySQL linked server on SQL Server 2008 Enterprise. Configuring a linked MySQL server as well as querying a table on the MySQL linked server is described. The reader would benefit reviewing the first article on this series on MySQL Servers. Introduction MS SQL servers always provided remote access to servers through RPC mechanisms, but they had the limitation of being confined to invoking procedures on remote locations. A linked server (a virtual server) may be considered a more flexible way of achieving the same thing, with the added benefits of remote table access and distributed queries. Microsoft manages the link mechanism via OLE DB technology. Specifically, an OLE DB data source points to the specific database that can be accessed using OLEDB. In this article, we will be creating a MySQL linked server on SQL Server 2008 and querying a database [TestMove] table shown in the next listing. In reviewing the previous article it may be noticed that the Employees tables were moved to MySQL database TestMove. In running the commands from the mysql> prompt it is assumed that the MySQL Server has been started. Listing 1: employees table in TestMove mysql> show tables; +--------------------+ | Tables_in_testmove | +--------------------+ | employees | +--------------------+ 1 row in set (0.09 sec) mysql> Creating an ODBC DSN for MySQL In the previous article on MySQL Servers cited earlier, a DSN was created for moving data. Essentially the same DSN can be used. Herein follows a brief review of the DSN MySQL_Link created along the same lines as in the previously referenced article. The ODBC driver used for creating this ODBC DSN is the one installed on the machine when the MySQL Server was installed as shown. The final interactive window where you may test the connectivity is shown in the next figure. You may notice that the database Testmove has been named in the ODBC DSN. The name MySQL_LINK is the ODBC DSN. When you close the window after clicking the OK button, an ODBC DSN item will be added to the System DSN tab of the ODBC wizard as shown. Steps to create a MySQL linked server from Management Studio Right click the Linked Servers node to display a drop-down menu as shown in the next figure. Click on New Linked Server...item. This brings up the New Linked Server window as shown. The window is all empty except for a default Provider. The very first thing to do is to provide a name for this linked server. Herein it is LINKED_ MYSQL. We will be using one of the providers [Microsoft OLE DB Provider for ODBC] supported by SQL Server 2008. You may access the list of OLE DB Providers in the Providers sub-folder of the Linked Servers. Provide the other details as shown in the next figure. Make sure they are entered exactly as shown or according to how you have created the database on MySQL Server. Click on the Security list item under General to the left. This opens the 'Security' page of the New Linked Server wizard as shown. Change the login from the default "Be made without using a security context" to "Be made using this security context". Enter remote login. In this case, it is "root" for the remote login and the password is the one used during the ODBC DSN (also the password for server authentication) creation. Make no changes to the Server Options page. Click OK. This creates a linked server Linked_MySQL as shown expanded in the Linked Server's node as shown. You may need to right-click and refresh the Linked Servers' node to see the new linked server. As you can see in the figure, the 'User' tables are not displayed.     Running Queries and reviewing results Running system stored procedures can provide various levels of information and the database can be queried using the four-part syntax and the openquery() method. Information on the linked server It is easy to find how the linked server is configured using system stored procedure sp_linkedsrvlogin on the SQL Server 2008. Open a Query window from File | New | Query Current Connection to open the query window and type in the following statement. The next figure shows the statement as well as the returned values. SQL Server 2008 querying has the intellisense report and this must be put to good use. Exec sp_linkedsrvlogin This shows all servers both local and remote as shown in the next figure. Information about the tables on the remote server can also be accessed by running a stored procedure. Executing the stored procedure sp_tables_ex as shown in the next figure (which displays the statement and the result of executing the stored procedure) can be used to obtain table information. Querying the table on the database Data in the table on the linked server can be queried using the openquery() function. The syntax for this function shown next is very simple. openquery ('linked server', 'query') The next figure shows the result of running the openquery() function on the Linked_MySQL linked server. Although it should be possible to query the linked server using the four-part syntax as in: Select * from LINKED_MYSQL...employees The above statement returns an error. This is probably a limitation of a combination of MSDASQL and the ODBC driver which does not provide the schema information correctly(this is just the author's opinion). Are Remote Procedure Calls (RPC) allowed? The easiest way to test this is to send out a call by running the following query against the linked server. Execute('Select FirstName, LastName from employees') at Linked_MYSQL If the linked server is not configured for RPC, then the result you get by running the above query is as shown in the next figure. Turn on RPC Earlier on we skipped the Server Options page of the linked server. Back in the Management Studio right click linked server LINKED_MYSQL and from the drop-down choose to look at the properties at the bottom of the list. This brings up the LINKED_MYSQL properties window as shown. Click on Server Options. In the Server Options page change the values of RPC and RPCOUT to true, default for both being false. Now run the query that produced the error previously. The result is displayed in the next figure. You might have noted that only two columns were returned from the employees table. This was deliberate as trying to get all the column would produce an error due partly to the data types of data stored in the table and their compatibility with MSDASQL and the ODBC driver (Again, an author's opinion). Creating Linked Server using TSQL While the linked server can be created using the built-in wizard of the Management Studio, it can also be created using TSQL statements as in the following listing (run both statements, the first one creates the linked server and the second the logins). Listing 2:  Exec master.dbo.sp_addlinkedserver @server=N'MySQlbyCode', @srvprodcut=N'MySQL', @provider=N'MSDASQL', @datasrc=N'MySQL_link' Exec master.dbo.sp_addlinkedserverlogin @server=N'MySQlbyCode', @locallogin=NULL, @rmtuser=N'root', @rmtpassword=N'<your password>' @rmtsrvname=N'localhost' Summary The article described the steps involved in configuring a MySql Linked server on SQL Server 2008 using the built-in New Linked Server wizard as well as TSQL. Method to query the linked server as well as enabling RPC were described. If you have read this article you may be interested in the following: MySQL Data Transfer using SQL Server Integration Services (SSIS) Transferring Data from MS Access 2003 to SQL Server 2008 Exporting data from MS Access 2003 to MySQL
Read more
  • 0
  • 0
  • 12655

article-image-creating-external-tables-oracle-10g11g-database
Packt Editorial Staff
07 Jun 2009
16 min read
Save for later

Creating External Tables in your Oracle 10g/11g Database

Packt Editorial Staff
07 Jun 2009
16 min read
In this two-part article by Hector R. Madrid, we will learn about the External Tables in Oracle 10g/11g Database. When working in data warehouse environments, the Extraction—Transformation—Loading (ETL) cycle frequently requires the user to load information from external sources in plain file format, or perform data transfers among Oracle database in a proprietary format. This requires the user to create control files to perform the load. As the format of the source data regularly doesn't fit the one required by the Data Warehouse, a common practice is to create stage tables that load data into the database and create several queries that perform the transformation from this point on, to take the data to its final destination. A better approach would be to perform this transformation 'on the fly' at load time. That is what External Tables are for. They are basically external files, managed either by means of the SQL*Loader or the data pump driver, which from the database perspective can be seen as if they were regular read-only tables. This format allows the user to think about the data source as if the data was already loaded into its stage table. This lets the user concentrate on the queries to perform the transformation, thus saving precious time during the load phase. The basics of an External Tables in Oracle10g/11g An External Table is basically a file that resides on the server side, as a regular flat file or as a data pump formatted file. The External Table is not a table itself; it is an external file with an Oracle format and its physical location. This feature first appeared back in Oracle 9i Release 1 and it was intended as a way of enhancing the ETL process by reading an external flat file as if it was a regular Oracle table. On its initial release it was only possible to create read-only External Tables, but, starting with 10g—it is possible to unload data to External Tables too. In previous 10g Releases, there was only the SQL*Loader driver could be used to read the External Table, but from 10g onwards it is now possible to load the table by means of the data pump driver. The kind of driver that will be used to read the External Table is defined at creation time. In the case of ORACLE_LOADER it is the same driver used by SQL*Loader. The flat files are loaded in the same way that a flat file is loaded to the database by means of the SQL*Loader utility, and the creation script can be created based on a SQL*Loader control file. In fact, most of the keywords that are valid for data loading are also valid to read an external flat file table. The main differences between SQL*Loader and External Tables are: When there are several input datafiles SQL*Loader will generate a bad file and a discard file for each datafile. The CONTINUEIF and CONCATENATE keywords are not supported by External Tables. The GRAPHIC, GRAPHIC EXTERNAL, and VARGRAPHIC are not supported for External Tables. LONG, nested tables, VARRAY, REF, primary key REF, and SID are not supported. For fields in External Tables the character set, decimal separator, date mask and other locale settings are determined by the database NLS settings. The use of the backslash character is allowed for SQL*Loader, but for External Tables this would raise an error. External Tables must use quotation marks instead.For example: SQL*Loader FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY "" External Tables TERMINATED BY ',' ENCLOSED BY "'" A second driver is available, the ORACLE_DATAPUMP access driver, which uses the Data Pump technology to read the table and unload data to an External Table. This driver allows the user to perform a logical backup that can later be read back to the database without actually loading the data. The ORACLE_DATAPUMP access driver utilizes a proprietary binary format for the external file, so it is not possible to view it as a flat file. Let's setup the environment Let's create the demonstration user, and prepare its environment to create an External Table. The example that will be developed first refers to the External Table using the ORACLE_LOADER driver. create user EXTTABDEMO identified by ORACLE default tablespace USERS; alter user exttabdemo quota unlimited on users; grant CREATE SESSION, CREATE TABLE, CREATE PROCEDURE, CREATE MATERIALIZED VIEW, ALTER SESSION, CREATE VIEW, CREATE ANY DIRECTORY to EXTTABDEMO; A simple formatted spool from this query will generate the required external table demonstration data. The original source table is the demonstration HR.EMPLOYEES table. select EMPLOYEE_ID ||','|| DEPARTMENT_ID ||','|| FIRST_NAME ||','|| LAST_NAME ||','|| PHONE_NUMBER ||','|| HIRE_DATE ||','|| JOB_ID ||','|| SALARY ||','|| COMMISSION_PCT ||','|| MANAGER_ID ||','|| EMAIL from HR.EMPLOYEES order by EMPLOYEE_ID The above query will produce the following sample data: The External Table directory is defined inside the database by means of a DIRECTORY object. This object is not validated at creation time, so the user must make sure the physical directory exists and the oracle OS user has read/write privileges on it. $ mkdir $HOME/external_table_dest SQL> CREATE DIRECTORY EXTTABDIR AS '/home/oracle/external_table_dest'; The above example was developed in a Linux environment, on Windows platforms the paths will need to be changed to correctly reflect how Oracle has been set up. Now, the first External Table can be created. A basic External Table Here is the source code of the External Table creation. The create table command syntax is just like any other regular table creation (A), (B), up to the point where the ORGANIZATION EXTERNAL (C) keyword appears, this is the point where the actual External Table definition starts. In this case the External Table is accessed by the ORACLE_LOADER driver (D). Next, the external flat file is defined, and here it is declared the Oracle DIRECTORY (E) where the flat file resides. The ACCESS PARAMETERS(F) section specifies how to access the flat file and it declares whether the file is a fixed or variable size record, and which other SQL*Loader loading options are declared. The LOCATION (H) keyword defines the name of the external data file. It must be pointed out that as this is an External Table managed by the SQL_LOADER driver the ACCESS_PARAMETERS section must be defined, in the case of External Tables based on the DATAPUMP_DRIVER this section is not required. The columns are defined only by name (G), not by type. This is permitted from the SQL*Loader perspective, and allows for dynamic column definition. This column schema definition is more flexible, but it has a drawback—data formats such as those in DATEcolumns must match the database date format, otherwise the row will be rejected. There are ways to define the date format working around this requirement. Assuming the date column changes from its original default format mask "DD-MON-RR" to "DD-MM-RR", then the column definition must change from a simple CHAR column to a DATE with format mask column definition. Original format: "HIRE_DATE" CHAR(255) Changed format: "HIRE_DATE" DATE "DD-MM-RR" When working with an External Table, the access parameter is not validated at creation time, so if there are malformed rows, or if there are improperly defined access parameters, an error is shown, similar to the one below. When working with an External Table, the access parameter is not validated at creation time, so if there are malformed rows, or if there are improperly defined access parameters, an error is shown, similar to the one below. ERROR at line 1: ORA-29913: error in executing ODCIEXTTABLEFETCH callout ORA-30653: reject limit reached ORA-06512: at "SYS.ORACLE_LOADER", line 52 Once the data is created and all required OS privileges have been properly validated, the data can be seen from inside the database, just as if it were a regular Oracle table. This table is read only, so if the user attempts to perform any DML operation against it, it will result in this error: SQL> delete ext_employees; delete ext_employees * ERROR at line 1: ORA-30657: operation not supported on external organized table As the error message clearly states, this kind of table is only useful for read only operations. This kind of table doesn't support most of the operations available for regular tables, such as index creation, and statistics gathering, and these types of operations will cause an ORA-30657 error too. The only access method available for External Tables is Full Table Scan, so there is no way to perform a selective data retrieval operation. The External Tables cannot be recovered, they are just metadata definitions stored in the dictionary tables. The actual data resides in external files, and there is no way to protect them with the regular backup database routines, so it is the user's sole responsibility to provide proper backup and data management procedures. At the database level the only kind of protection the External Table receives is at the metadata level, as it is an object stored as a definition at the database dictionary level. As the data resides in the external data file, if by any means it were to be corrupted, altered, or somehow modified, there would be no way to get back the original data. If the external data file is lost, then this may go unnoticed, until the next SELECT operation takes place. This metadata for an External Table is recorded at the {USER | ALL | DBA}_TABLES view, and as this table doesn't actually require physical database storage, all storage related columns appear as null, as well as the columns that relate to the statistical information. This table is described with the {USER | ALL | DBA}_EXTERNAL_TABLES view, where information such as the kind of driver access, the reject_limit, and the access_parameters, amongst others, are described. SQL> DESC USER_EXTERNAL_TABLES Name Null? Type ------------------------------- -------- -------------- TABLE_NAME NOT NULL VARCHAR2(30) TYPE_OWNER CHAR(3) TYPE_NAME NOT NULL VARCHAR2(30) DEFAULT_DIRECTORY_OWNER CHAR(3) DEFAULT_DIRECTORY_NAME NOT NULL VARCHAR2(30) REJECT_LIMIT VARCHAR2(40) ACCESS_TYPE VARCHAR2(7) ACCESS_PARAMETERS VARCHAR2(4000) PROPERTY VARCHAR2(10) This is the first basic External Table, and as previously shown, its creation is pretty simple. It allows external data to be easily accessed from inside the database, allowing the user to see the external data just as if it was already loaded inside a regular stage table. Creating External Table metadata, the easy way To further illustrate the tight relationship between SQL*Loader and External Tables, the SQL*Loader tool may be used to generate a script that creates an External Table according to a pre-existing control file. SQL*Loader has a command line option named EXTERNAL_TABLE, this can hold one of three different parameters {NOT_USED | GENERATE_ONLY | EXECUTE}. If nothing is set, it defaults to the NOT_USED option. This keyword is used to generate the script to create an External Table, and the options mean: NOT_USED: This is the default option, and it means that no External Tables are to be used in this load. GENERATE_ONLY: If this option is specified, then SQL*Loader will only read the definitions from the control file and generate the required commands, so the user can record them for later execution, or tailor them to fit his/her particular needs. EXECUTE: This not only generates the External Table scripts, but also executes them. If the user requires a sequence, then the EXECUTE option will not only create the table, but it will also create the required sequence, deleting it once the data load is finished. This option performs the data load process against the specified target regular by means of an External Table, it creates both the directory and the External Table, and inserts the data using a SELECT AS INSERTwith the APPEND hint. Let's use the GENERATE_ONLY option to generate the External Table creation scripts: $ sqlldr exttabdemo/oracle employees external_table=GENERATE_ONLY By default the log file is located in a file whose extension is .log and its name equals that of the control file. By opening it we see, among the whole log processing information, this set of DDL commands: CREATE TABLE "SYS_SQLLDR_X_EXT_EMPLOYEES" ( "EMPLOYEE_ID" NUMBER(6), "FIRST_NAME" VARCHAR2(20), "LAST_NAME" VARCHAR2(25), "EMAIL" VARCHAR2(25), "PHONE_NUMBER" VARCHAR2(20), "HIRE_DATE" DATE, "JOB_ID" VARCHAR2(10), "SALARY" NUMBER(8,2), "COMMISSION_PCT" NUMBER(2,2), "MANAGER_ID" NUMBER(6), "DEPARTMENT_ID" NUMBER(4) ) ORGANIZATION external ( TYPE oracle_loader DEFAULT DIRECTORY EXTTABDIR ACCESS PARAMETERS ( RECORDS DELIMITED BY NEWLINE CHARACTERSET US7ASCII BADFILE 'EXTTABDIR':'employees.bad' LOGFILE 'employees.log_xt' READSIZE 1048576 FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"' LDRTRIM REJECT ROWS WITH ALL NULL FIELDS ( "EMPLOYEE_ID" CHAR(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"', "FIRST_NAME" CHAR(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"', "LAST_NAME" CHAR(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"', "EMAIL" CHAR(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"', "PHONE_NUMBER" CHAR(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"', "HIRE_DATE" CHAR(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"', "JOB_ID" CHAR(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"', "SALARY" CHAR(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"', "COMMISSION_PCT" CHAR(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"', "MANAGER_ID" CHAR(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"', "DEPARTMENT_ID" CHAR(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"' ) ) location ( 'employees.txt' ) ) The more complete version is shown, some differences with the basic script are: All the column definitions are set to CHAR(255) with the delimiter character defined for each column If the current working directory is already registered as a regular DIRECTORY at the database level, SQL*Loader utilizes it, otherwise, it creates a new directory definition The script specifies where the bad files and log file are located It specifies that an all-null column record is rejected The more complete version is shown, some differences with the basic script are: All the column definitions are set to CHAR(255) with the delimiter character defined for each column If the current working directory is already registered as a regular DIRECTORY at the database level, SQL*Loader utilizes it, otherwise, it creates a new directory definition The script specifies where the bad files and log file are located It specifies that an all-null column record is rejected In the case of the EXECUTE keyword, the log file shows that not only are the scripts used to create the External Table, but also to execute the INSERT statement with the /*+ append */hint. The load is performed in direct path mode. All External Tables, when accessed, generate a log file. In the case of the ORACLE_LOADERdriver, this file is similar to the file generated by SQL*Loader. It has a different format in the case of ORACLE_DATAPUMP driver. The log file is generated in the same location where the external file resides, and its format is as follows: <EXTERNAL_TABLE_NAME>_<OraclePID>.log When an ORACLE_LOADER managed External Table has errors, it dumps the 'bad' rows to the *.bad file, just the same as if this was loaded by SQL*Loader. The ORACLE_DATAPUMP External Table generates a simpler log file, it only contains the time stamp when the External Table was accessed, and it creates a log file for each oracle process accessing the External Table. Unloading data to External Tables The driver used to unload data to an External Table is the ORACLE_DATAPUMP access driver. It dumps the contents of a table in a binary proprietary format file. This way you can exchange data with other 10g and higher databases in a preformatted way to meet the other database's requirements. Unloading data to an External Table doesn't make it updateable, the tables are still limited to being read only. Let's unload the EMPLOYEES table to an External Table: create table dp_employees organization external( type oracle_datapump default directory EXTTABDIR location ('dp_employees.dmp') ) as select * from employees; This creates a table named DP_EMPLOYEES, located at the specified EXTTABDIR directory and with a defined OS file name. In the next example, at a different database a new DP_EMPLOYEES table is created, this table uses the already unloaded data by the first database. This DP_EMPLOYEES External Table is created on the 11g database side. create table dp_employees( EMPLOYEE_ID NUMBER(6), FIRST_NAME VARCHAR2(20), LAST_NAME VARCHAR2(25), EMAIL VARCHAR2(25), PHONE_NUMBER VARCHAR2(20), HIRE_DATE DATE, JOB_ID VARCHAR2(10), SALARY NUMBER(8,2), COMMISSION_PCT NUMBER(2,2), MANAGER_ID NUMBER(6), DEPARTMENT_ID NUMBER(4) ) organization external ( type oracle_datapump default directory EXTTABDIR location ('dp_employees.dmp') ); This table can already read in the unloaded data from the first database. The second database is a regular 11g database. So this shows the inter-version upward compatibility between a 10g and an 11g database. SQL> select count(*) from dp_employees; COUNT(*) ---------- 107 Inter-version compatibility In, the previous example a 10g data pump generated an External Table that was transparently read by the 11g release. Let's create an 11g data pump External Table named DP_DEPARTMENTS: create table dp_departments organization external( type oracle_datapump default directory EXTTABDIR access parameters ( version '10.2.0' ) location ('dp_departments.dmp') ) as select * from departments Table created. SQL> select count(*) from dp_departments; COUNT(*) ---------- 27 In the previous example, it is important to point out that the VERSION keyword defines the compatibility format. access parameters ( version '10.2.0' ) If this clause is not specified then an incompatibility error will be displayed. SQL> select count(*) from dp_departments; select count(*) from dp_departments * ERROR at line 1: ORA-29913: error in executing ODCIEXTTABLEOPEN callout ORA-39142: incompatible version number 2.1 in dump file "/home/oracle/external_table_dest/dp_departments.dmp" ORA-06512: at "SYS.ORACLE_DATAPUMP", line 19 Now let's use the 10g version to read from it. SQL> select count(*) from dp_departments; COUNT(*) ---------- 27 The VERSION clause is interpreted the same way as the VERSION clause for the data pump export, it has three different values: COMPATIBLE: This states that the version of the metadata corresponds to the database compatibility level. LATEST: This corresponds to the database version. VERSION NUMBER: This refers to a specific oracle version that the file is compatible with. This value cannot be lower than 9.2.0 Summary As we can see, External Tables can serve not only as improvements to the ETL process, but also as a means to manage database environments, and a means of reducing the complexity level of data management from the user's point of view. In the next part, we will see how External Tables can be used for data transformation and the enhancements Oracle 11g has brought about in External Tables.
Read more
  • 0
  • 0
  • 12505

article-image-how-to-prevent-errors-while-using-utilities-for-loading-data-in-teradata
Pravin Dhandre
11 Jun 2018
9 min read
Save for later

How to prevent errors while using utilities for loading data in Teradata

Pravin Dhandre
11 Jun 2018
9 min read
In today’s tutorial we will assist you to overcome the errors that arise while loading, deleting or updating large volumes of data using Teradata Utilities. [box type="note" align="" class="" width=""]This article is an excerpt from Teradata Cookbook co-authored by Abhinav Khandelwal and Rajsekhar Bhamidipati. This book provides recipes to simplify the daily tasks performed by database administrators (DBA) along with providing efficient data warehousing solutions in Teradata database system.[/box] Resolving FastLoad error 2652 When data is being loaded via FastLoad, a table lock is placed on the target table. This means that the table is unavailable for any other operation. A lock on a table is only released when FastLoad encounters the END LOADING command, which terminates phase 2, the so-called application phase. FastLoad may get terminated in phase 1 due to any of the following reasons: Load script results in failure (error code 8 or 12) Load script is aborted by admin or some other session FastLoad fails due to bad record or file Forgetting to add end loading statement in script If so, it keeps a lock on the table, which needs to be released manually. In this recipe, we will see the steps to release FastLoad locks. Getting ready Identify the table on which FastLoad is been ended prematurely and tables are in locked state. You need to have valid credentials for the Teradata Database. Execute the dummy FastLoad script from the same user or the user which has write access to the lock table. A user requires the following privileges/rights in order to execute the FastLoad: SELECT and INSERT (CREATE and DROP or DELETE) access to the target or loading table CREATE and DROP TABLE on error tables SELECT, INSERT, UPDATE, and DELETE are required privileges for the user PUBLIC on the restart log table (SYSADMIN.FASTLOG). There will be a row in the FASTLOG table for each FastLoad job that has not completed in the system. How to do it... Open a notepad and create the following script: .LOGON 127.0.0.1/dbc, dbc; /* Vaild system name and credentials to your system */ .DATABASE Database_Name; /* database under which locked table is */ erorfiles errortable_name, uv_tablename /* same error table name as in script */ begin loading locked_table; /* table which is getting 2652 error */ .END LOADING; /* to end pahse 2 and release the lock */ .LOGOFF; Save it as dummy_fl.txt. Open the windows Command Prompt and execute this using the FastLoad command, as shown in the following screenshot: This dummy script with no insert statement should release the lock on the target Table. Execute Select on the locked table to see if the lock is released on the table. How it works... As FastLoad is designed to work only on empty tables, it becomes necessary that the loading of the table finishes in one go. If the load script is errored out prematurely in phase 2, without encountering the END loading command, it leaves a lock on loading the table. Fastload locks can't be released via the HUT utility, as there are no technical lock on the table. To execute FastLoad, the following are some requirements: Log table: FastLoad puts its progress information in the fastlog table. EMPTY TABLE: FastLoad needs the table to be empty before inserting rows into that table. TWO ERROR TABLES: FastLoad requires two error tables to be created; you just need to name them, and no ddl is required. The first error table records any translation or constraint violation error, whereas the second error table captures errors related to the duplication of values for Unique Primary Indexes (UPI). After the completion of FastLoad, you can analyze these error tables as to why the records got rejected. There's more... If this does not fix the issue, you need to drop the target table and error tables associated with it. Before proceeding with dropping tables, check with the administrator to abort any FastLoad sessions associated with this table. Resolving MLOAD error 2571 MLOAD works in five phases, unlike FastLoad, which only works in two phases. MLOAD can fail in either phase three or four. Figure shows 5 stages of MLOAD. Preliminary: Basic setup. Syntax checking, establishing session with the Teradata Database, creation of error tables (two error tables per target table), and the creation of work tables and log tables are done in this phase. DML Transaction phase: Request is parse through PE and a step plan is generated. Steps and DML are then sent to AMP and stored in appropriate work tables for each target table. Input data sent will be stored in these work tables, which will be applied to the target table later on. Acquisition phase: Unsorted data is sent to AMP in blocks of 64K. Rows are hashed by PI and sent to appropriate AMPs. Utility places locks on target tables in preparation for the application phase to apply rows in target tables. Application phase: Changes are applied to target tables and NUSI subtables. Lock on table is held in this phase. Cleanup phase: If the error code of all the steps is 0, MLOAD successfully completes and releases all the locks on the specified table. This being the case, all empty error tables, worktables, and the log table are dropped. Getting ready Identify the table which is getting affected by error 2571. Make sure no host utility is running on this table and the load job is in a failed state for this table. How to do it... Check on viewpoint for any active utility job for this table. If you find any active job, let it complete. If there is a reason that you need to release the lock, first abort all the sessions of the host utility from viewpoint. Ask your administrator to do it. Execute the following command: RELEASE MLOAD <databasename.tablename>; > If you get a Not able to release MLOAD Lock error, execute the following Command: /* Release lock in application phase */ RELEASE MLOAD <databasename.tablename> in apply; Once the locks are released you need to drop all the associated error tables, the log table, and work tables with it. Re-execute MLOAD after correcting the error. How it works... The Mload utility places a lock in table headers to alert other utilities that a MultiLoad is in session for this table. They include: Acquisition lock: DML allows all DDL allows DROP only Application lock: DML allows SELECT with ACCESS only DDL allows DROP only There's more... If the release lock statement still gives an error and does not release the lock on the table, you need to use SELECT with the ACCESS lock to copy the content of the locked table to a new one and drop the locked tables. If you start receiving the error 7446 Mload table %ID cannot be released because NUSI exists, you need to drop all the NUSI on the table and use ALTER Table to nonfallback to accomplish the task. Resolving failure 7547 This error is associated with the UPDATE statement, which could be SQL based or could be in MLOAD. Various times, while updating the set of rows in a table, the update fails on Failure 7547 Target row updated by multiple source rows. This error will happen when you update the target with multiple rows from the source. This means there are duplicated values present in the source tables. Getting ready Let's create sample volatile tables and insert values into them. After that, we will execute the UPDATE command, which will fail to result in 7547: Create a TARGET TABLE with the following DDL and insert values into it: ** TARGET TABLE** create volatile table accounts ( CUST_ID, CUST_NAME, Sal )with data primary index(cust_id) insert values (1,'will',2000); insert values (2,'bekky',2800); insert values (3,'himesh',4000); Create a SOURCE TABLE with the following DDL and insert values into it: ** SOURCE TABLE** create volatile table Hr_payhike ( CUST_ID, CUST_NAME, Sal_hike ) with data primary index(cust_id) insert values (1,'will',2030); insert values (1,'bekky',3800); insert values (3,'himesh',7000); Execute the MLOAD script. Following the snippet from the MLOAD script, only update part (which will fail): /* Snippet from MLOAD update */ UPDATE ACC FROM ACCOUNTS ACC , Hr_payhike SUPD SET Sal= TUPD.Sal_hike WHERE Acc.CUST_ID = SUPD.CUST_ID; Failure: Target row updated by multiple source rows How to do it... Check for duplicate values in the source table using the following: /*Check for duplicate values in source table*/ SELECT cust_id,count(*) from Hr_payhike group by 1 order by 2 desc The output will be generated with CUST_ID =1 and has two values which are causing errors. The reason for this is that while updating the TARGET table, the optimizer won't be able to understand from which row it should update the TARGET row. Who's salary will be updated Will or Bekky? To resolve the error, execute the following update query: /* Update part of MLOAD */ UPDATE ACC FROM ACCOUNTS ACC , ( SELECT CUST_ID, CUST_NAME, SAL_HIKE FROM Hr_payhike QUALIFY ROW_NUMBER() OVER (PARTITION BY CUST_ID ORDER BY CUST_NAME,SAL_HIKE DESC)=1) SUPD SET Sal= SUPD.Sal_hike WHERE Acc.CUST_ID = SUPD.CUST_ID; Now, the update will run without error. How it works... Failure will happen when you update the target with multiple rows from the source. If you defined a primary index column for your target, and if those columns are in an update query condition, this error will occur. To further resolve this, you can delete the duplicate from the source table itself and execute the original update without any modification. But if the source data can't be changed, then you need to change the update statement. To summarize, we have successfully learned how to overcome or prevent errors while using utilities for loading data into database. You could also check out the Teradata Cookbook  for more than 100 recipes on enterprise data warehousing solutions. 2018 is the year of graph databases. Here’s why. 6 reasons to choose MySQL 8 for designing database solutions Amazon Neptune, AWS’ cloud graph database, is now generally available
Read more
  • 0
  • 0
  • 12416
article-image-sql-server-analysis-services-administering-and-monitoring-analysis-services
Packt
20 Dec 2013
19 min read
Save for later

SQL Server Analysis Services – Administering and Monitoring Analysis Services

Packt
20 Dec 2013
19 min read
(For more resources related to this topic, see here.) If your environment has only one or a handful of SSAS instances, they can be managed by the same database administrators managing SQL Server and other database platforms. In large enterprises, there could be hundreds of SSAS instances managed by dedicated SSAS administrators. Regardless of the environment, you should become familiar with the configuration options as well as troubleshooting methodologies. In large enterprises, you might also be required to automate these tasks using the Analysis Management Objects (AMO) code. Analysis Services is a great tool for building business intelligence solutions. However, much like any other software, it does have its fair share of challenges and limitations. Most frequently encountered enterprise business intelligence system goals include quick provision of relevant data to the business users and assuring excellent query performance. If your cubes serve a large, global community of users, you will quickly learn that SSAS is optimized to run a single query as fast as possible. Once users send a multitude of heavy queries in parallel, you can expect to see memory, CPU, and disk-related performance counters to quickly rise, with a corresponding increase in query execution duration which, in turn, worsens user experience. Although you could build aggregations to improve query performance, doing so will lengthen cube processing time, and thereby, delay the delivery of essential data to decision makers. It might also be tempting to consider using ROLAP storage mode in lieu of MOLAP so that processing times are shorter, but MOLAP queries usually outperform ROLAP due to heavy compression rates. Hence, figuring out the right storage mode and appropriate level of aggregations is a great balancing act. If you cannot afford using ROLAP, and query performance is paramount to successful cube implementation, you should consider scaling your solution. You have two options for scaling, given as follows: Scaling up: This option means purchasing servers with more memory, more CPU cores, and faster disk drives Scaling out: This option means purchasing several servers of approximately the same capacity and distributing the querying workload across multiple servers using a load balancing tool SSAS lends itself best to the second option—scaling out. Later in this article you will learn how to separate processing and querying activities and how to ensure that all servers in the querying pool have the same data. SSAS instance configuration options All Analysis Services configuration options are available in the msmdsrv.ini file found in the config folder under the SSAS installation directory. Instance administrators can also modify some, but not all configuration properties, using SQL Server Management Studio (SSMS). SSAS has a multitude of properties that are undocumented—this normally means that such properties haven't undergone thorough testing, even by the software's developers. Hence, if you don't know exactly what the configuration setting does, it's best to leave the setting at default value. Even if you want to test various properties on a sandbox server, make a copy of the configuration file prior to applying any changes. How to do it... To modify the SSAS instance settings using the configuration file, perform the following steps: Navigate to the config folder within your Analysis Services installation directory. By default, this will be C:\Program Files\Microsoft SQL Server\MSAS11.instance_name\OLAP\Config. Open the msmdsrv.ini file using Notepad or another text editor of your choice. The file is in the XML format, so every property is enclosed in opening and closing tags. Search for the property of interest, modify its value as desired, and save the changes. For example, in order to change the upper limit of the processing worker threads, you would look for the <ThreadPool><Process><MaxThreads> tag sequence and set the values as shown in the following excerpt from the configuration file: <Process>       <MinThreads>0</MinThreads>       <MaxThreads>250</MaxThreads>      <PriorityRatio>2</PriorityRatio>       <Concurrency>2</Concurrency>       <StackSizeKB>0</StackSizeKB>       <GroupAffinity/>     </Process> To change the configuration using SSMS, perform the following steps: Connect to the SSAS instance using the instance administrator account and choose Properties. If your account does not have sufficient permissions, you will get an error that only administrators can edit server properties. Change the desired properties by altering the Value column on the General page of the resulting dialog, as shown in the following screenshot: Advanced properties are hidden by default. You must check the Show Advanced (All) Properties box to see advanced properties. You will not see all the properties in SSMS even after checking this box. The only way to edit some properties is by editing msmdsrv.ini as previously discussed. Make a note of the Reset Default button in the bottom-right corner. This button comes in handy if you've forgotten what the configuration values were before you changed them and want to revert to the default settings. The default values are shown in the dialog box, which can provide guidance as to which properties have been altered. Some configuration settings require restarting the SSAS instance prior to being executed. If this is the case, the Restart column will have a value of Yes. Once you're happy with your changes, click on OK and restart the instance if necessary. You can restart SSAS using the Services.msc applet from the command line using the NET STOP / NET START commands, or directly in SSMS by choosing the Restart option after right-clicking on the instance. How it works... Discussing every SSAS property would make this article extremely lengthy; doing so is well beyond the scope of the book. Instead, in this section, I will summarize the most frequently used properties. Often, synchronization has to copy large partition datafiles and aggregation files. If the timeout value is exceeded, synchronization fails. Increase the value of the <Network><Listener><ServerSendTimeout> and <Network><Listener><ServerReceiveTimeout> properties to allow a longer time span for copying each file. By default, SSAS can use a lazy thread to rebuild missing indexes and aggregations after you process partition data. If the <OLAP><LazyProcessing><Enabled> property is set to 0, the lazy thread is not used for building missing indexes—you must use an explicit processing command instead. The <OLAP><LazyProcessing><MaxCPUUsage> property throttles the maximum CPU that could be used by the lazy thread. If efficient data delivery is your topmost priority, you can exploit the ProcessData option instead of ProcessFull. To build aggregations after the data is loaded, you must set the partition's ProcessingMode property to LazyAggregations. The SSAS formula engine is single threaded, so queries that perform heavy calculations will only use one CPU core, even on a multiCPU computer. The storage engine is multithreaded; hence, queries that read many partitions will require many CPU cycles. If you expect storage engine heavy queries, you should lower the CPU usage threshold for LazyAggregations. By default, Analysis Services records subcubes requested for every 10th query in the query log table. If you'd like to design aggregations based on query logs, you should change the <Log><QueryLog><QueryLogSampling> property value to 1 so that the SSAS logs subcube requests for every query. SSAS can use its own memory manager or the Windows memory manager. If your SSAS instance consistently becomes unresponsive, you could try using the Windows memory manager. Set <Memory><MemoryHeapType> to 2 and <Memory><HeapTypeForObjects> to 0. The Analysis Services memory manager values are 1 for both the properties. You must restart the SSAS service for the changes to these properties to take effect. The <Memory><PreAllocate> property specifies the percentage of total memory to be reserved at SSAS startup. SSAS normally allocates memory dynamically as it is required by queries and processing jobs. In some cases, you can achieve performance improvement by allocating a portion of the memory when the SSAS service starts. Setting this value will increase the time required to start the service. The memory will not be released back to the operating system until you stop the SSAS service. You must restart the SSAS service for changes to this property to take effect. The <Log><FlightRecorder><FileSizeMB>and <Log><FlightRecorder><LogDurationSec> properties control the size and age of the FlightRecorder trace file before it is recycled. You can supply your own trace definition file to include the trace events and columns you wish to monitor using the <Log><FlightRecorder><TraceDefinitionFile> property. If FlightRecorder collects useful trace events, it can be an invaluable troubleshooting tool. By default, the file is only allowed to grow to 10 MB or 60 minutes. Long processing jobs can take up much more space, and their duration could be much longer than 60 minutes. Hence, you should adjust the settings as necessary for your monitoring needs. You should also adjust the trace events and columns to be captured by FlightRecorder. You should consider adjusting the duration to cover three days (in case the issue you are researching happens over a weekend). The <Memory><LowMemoryLimit> property controls the point—amount of memory used by SSAS—at which the cleaner thread becomes actively engaged in reclaiming memory from existing jobs. Each SSAS command (query, processing, backup, synchronization, and so on) is associated with jobs that run on threads and use system resources. We can lower the value of this setting to run more jobs in parallel (though the performance of each job could suffer). Two properties control the maximum amount of memory that a SSAS instance could use. Once memory usage reaches the value specified by <Memory><TotalMemoryLimit>, the cleaner thread becomes particularly aggressive at reclaiming memory. The <Memory><HardMemoryLimit> property specifies the absolute memory limit—SSAS will not use memory above this limit. These properties are useful if you have SSAS and other applications installed on the same server computer. You should reserve some memory for other applications and the operating system as well. When HardMemoryLimit is reached, SSAS will disconnect the active sessions, advising that the operation was cancelled due to memory pressure. All memory settings are expressed in percentages if the values are less than or equal to 100. Values above 100 are interpreted as kilobytes. All memory configuration changes require restart of the SSAS service to take effect. In the prior releases of Analysis Services, you could only specify the minimum and maximum number of threads used for queries and processing jobs. With SSAS 2012, you can also specify the limits for the input/output job threads using the <ThreadPool><IOProcess> property. The <Process><IndexBuildThreshold> property governs the minimum number of rows within a partition for which SSAS will build indexes. The default value is 4096. SSAS decides which partitions it needs to scan for each query based on the partition index files. If the partition does not have indexes, it will be scanned for all the queries. Normally, SSAS can read small partitions without greatly affecting query performance. But if you have many small partitions, you should lower the threshold to ensure each partition has indexes. The <Process><BufferRecordLimit> and <Process><BufferMemoryLimit> properties specify the number of records for each memory buffer and the maximum percentage of memory that can be used by a memory buffer. Lower the value of these properties to process more partitions in parallel. You should monitor processing using the SQL Profiler to see if some partitions included in the processing batch are being processed while the others are in waiting. The <ExternalConnectionTimeout> and <ExternalCommandTimeout> properties control how long an SSAS command should wait for connecting to a relational database or how long SSAS should wait to execute the relational query before reporting timeout. Depending on the relational source, it might take longer than 60 seconds (that is, the default value) to connect. If you encounter processing errors without being able to connect to the relational source, you should increase the ExternalConnectionTimeout value. It could also take a long time to execute a query; by default, the processing query will timeout after one hour. Adjust the value as needed to prevent processing failures. The contents of the <AllowedBrowsingFolders> property define the drives and directories that are visible when creating databases, collecting backups, and so on. You can specify multiple items separated using the pipe (|) character. The <ForceCommitTimeout> property defines how long a processing job's commit operation should wait prior to cancelling any queries/jobs which may interfere with processing or synchronization. A long running query can block synchronization or processing from committing its transaction. You can adjust the value of this property from its default value of 30 seconds to ensure that processing and queries don't step on each other. The <Port> property specifies the port number for the SSAS instance. You can use the hostname followed by a colon (:) and a port number for connecting to the SSAS instance in lieu of the instance name. Be careful not to supply the port number used by another application; if you do so, the SSAS service won't start. The <ServerTimeout> property specifies the number of milliseconds after which a query will timeout. The default value is 1 hour, which could be too long for analytical queries. If the query runs for an hour, using up system resources, it could render the instance unusable by any other connection. You can also define a query timeout value in the client application's connection strings. Client setting overrides the server-level property. There's more... There are many other properties you can set to alter SSAS instance behavior. For additional information on configuration properties, please refer to product documentation at http://technet.microsoft.com/en-us/library/ms174556.aspx. Creating and dropping databases Only SSAS instance administrators are permitted to create, drop, restore, detach, attach, and synchronize databases. This recipe teaches administrators how to create and drop databases. Getting ready Launch SSMS and connect to your Analysis Services instance as an administrator. If you're not certain that you have administrative properties to the instance, right-click on the SSAS instance and choose Properties. If you can view the instance's properties, you are an administrator; otherwise, you will get an error indicating that only instance administrators can view and alter properties. How to do it... To create a database, perform the following steps: Right-click on the Databases folder and choose New Database. Doing so launches the New Database dialog shown in the following screenshot. Specify a descriptive name for the database, for example, Analysis_Services_Administration. Note that the database name can contain spaces. Each object has a name as well as an identifier. The identifier value is set to the object's original name and cannot be changed without dropping and recreating the database; hence, it is important to come up with a descriptive name from the very beginning. You cannot create more than one database with the same name on any SSAS instance. Specify the storage location for the database. By default, the database will be stored under the \OLAP\DATA folder of your SSAS installation directory. The only compelling reason to change the default is if your data drive is running out of disk space and cannot support the new database's storage requirements. Specify the impersonation setting for the database. You could also specify the impersonation property for each data source. Alternatively, each data source can inherit the DataSourceImpersonationInfo property from the database-level setting. You have four choices as follows: Specific user name (must be a domain user) and password: This is the most secure option but requires updating the password if the user changes the password Analysis Services service account Credentials of the current user: This option is specifically for data mining Default: This option is the same as using the service account option Specify an optional description for the database. As with majority of other SSMS dialogs, you can script the XMLA command you are about to execute by clicking on the Script button. To drop an existing database, perform the following steps: Expand the Databases folder on the SSAS instance, right-click on the database, and choose Delete. The Delete objects dialog allows you to ignore errors; however, it is not applicable to databases. You can script the XMLA command if you wish to review it first. An alternative way of scripting the DELETE command is to right-click on the database and navigate to Script database as | Delete To | New query window. Monitoring SSAS instance using Activity Viewer Unlike other database systems, Analysis Services has no system databases. However, administrators still need to check the activity on the server, ensure that cubes are available and can be queried, and there is no blocking. You can exploit a tool named Analysis Services Activity Viewer 2008 to monitor SSAS Versions 2008 and later, including SSAS 2012. This tool is owned and maintained by the SSAS community and can be downloaded from www.codeplex.com. Activity Viewer allows viewing active and dormant sessions, current XMLA and MDX queries, locks, as well as CPU and I/O usage by each connection. Additionally, you can define rules to raise alerts when a particular condition is met. How to do it... To monitor an SSAS instance using Activity Viewer, perform the following steps: Launch the application by double-clicking on ActivityViewer.exe. Click on the Add New Connection button on the Overview tab. Specify the hostname and instance name or the hostname and port number for the SSAS instance and then click on OK. For each SSAS instance you connect to, Activity Viewer adds a new tab. Click on the tab for your SSAS instance. Here, you will see several pages as shown in the following screenshot: Alerts: This page shows any sessions that met the condition found in the Rules page. Users: This page displays one row for each user as well as the number of sessions, total memory, CPU, and I/O usage. Active Sessions: This page displays each session that is actively running an MDX, Data Mining Extensions (DMX), or XMLA query. This page allows you to cancel a specific session by clicking on the Cancel Session button. Current Queries: This page displays the actual command's text, number of kilobytes read and written by the command, and the amount of  CPU time used by the command. This page allows you to cancel a specific query by clicking on the Cancel Query button. Dormant Sessions: This page displays sessions that have a connection to the SSAS instance but are not currently running any queries. You can also disconnect a dormant session by clicking on the Cancel Session button. CPU: This page allows you to review the CPU time used by the session as well as the last command executed on the session. I/O: This page displays the number of reads and writes as well as the kilobytes read and written by each session. Objects: This page shows the CPU time and number of reads affecting each dimension and partition. This page also shows the full path to the object's parent; this is useful if you have the same naming convention for partitions in multiple measure groups. Not only do you see the partition name, but also the full path to the partition's measure group. This page also shows the number of aggregation hits for each partition. If you find that a partition is frequently queried and requires many reads, you should consider building aggregations for it. Locks: This page displays the locks currently in place, whether already granted or waiting. Be sure to check the Lock Status column—the value of 0 indicates that the lock request is currently blocked. Rules: This page allows defining conditions that will result in an alert. For example, if the session is idle for over 30 minutes or if an MDX query takes over 30 minutes, you should get alerted. How it works... Activity Viewer monitors Analysis Services using Dynamic Management Views (DMV). In fact, capturing queries executed by Activity Viewer using SQL Server Profiler is a good way of familiarizing yourself with SSAS DMV's. For example, the Current Queries page checks the $system.DISCOVER_COMMANDS DMV for any actively executing commands by running the following query: SELECT SESSION_SPID,COMMAND_CPU_TIME_MS,COMMAND_ELAPSED_TIME_MS,   COMMAND_READ_KB,COMMAND_WRITE_KB, COMMAND_TEXT FROM $system.DISCOVER_COMMANDS WHERE COMMAND_ELAPSED_TIME_MS > 0 ORDER BY COMMAND_CPU_TIME_MS DESC The Active Sessions page checks the $system.DISCOVER_SESSIONS DMV with the session status set to 1 using the following query: SELECT SESSION_SPID,SESSION_USER_NAME, SESSION_START_TIME,   SESSION_ELAPSED_TIME_MS,SESSION_CPU_TIME_MS, SESSION_ID FROM $SYSTEM.DISCOVER_SESSIONS WHERE SESSION_STATUS = 1 ORDER BY SESSION_USER_NAME DESC The Dormant sessions page runs a very similar query to that of the Active Sessions page, except it checks for sessions with SESSION_STATUS=0—sessions that are currently not running any queries. The result set is also limited to top 10 sessions based on idle time measured in milliseconds. The Locks page examines all the columns of the $system.DISCOVER_LOCKS DMV to find all requested locks as well as lock creation time, lock type, and lock status. As you have already learned, the lock status of 0 indicates that the request is blocked, whereas the lock status of 1 means that the request has been granted. Analysis Services blocking can be caused by conflicting operations that attempt to query and modify objects. For example, a long running query can block a processing or synchronization job from completion because processing will change the data values. Similarly, a command altering the database structure will block queries. The database administrator or instance administrator can explicitly issue the LOCK XMLA command as well as the BEGIN TRANSACTION command. Other operations request locks implicitly. The following table documents most frequently encountered Analysis Services lock types: Lock type identifier Description Acquired for 2 Read lock Processing to read metadata. 4 Write lock Processing to write data after it is read from relational sources. 8 Commit shared During the processing, restore or synchronization commands. 16 Commit exclusive Committing the processing, restore, or synchronization transaction when existing files are replaced by new files.  
Read more
  • 0
  • 0
  • 12400

article-image-what-is-lstm
Richard Gall
11 Apr 2018
3 min read
Save for later

What is LSTM?

Richard Gall
11 Apr 2018
3 min read
What does LSTM stand for? LSTM stands for long short term memory. It is a model or architecture that extends the memory of recurrent neural networks. Typically, recurrent neural networks have 'short term memory' in that they use persistent previous information to be used in the current neural network. Essentially, the previous information is used in the present task. That means we do not have a list of all of the previous information available for the neural node. Find out how LSTM works alongside recurrent neural networks. Watch this short video tutorial. How does LSTM work? LSTM introduces long-term memory into recurrent neural networks. It mitigates the vanishing gradient problem, which is where the neural network stops learning because the updates to the various weights within a given neural network become smaller and smaller. It does this by using a series of 'gates'. These are contained in memory blocks which are connected through layers, like this: There are three types of gates within a unit: Input Gate: Scales input to cell (write) Output Gate: Scales output to cell (read) Forget Gate: Scales old cell value (reset) Each gate is like a switch that controls the read/write, thus incorporating the long-term memory function into the model. Applications of LSTM There are a huge range of ways that LSTM can be used, including: Handwriting recognition Time series anomaly detection Speech recognition Learning grammar Composing music The difference between LSTM and GRU There are many similarities between LSTM and GRU (Gated Recurrent Units). However, there are some important differences that are worth remembering: A GRU has two gates, whereas an LSTM has three gates. GRUs don't possess any internal memory that is different from the exposed hidden state. They don't have the output gate, which is present in LSTMs. There is no second nonlinearity applied when computing the output in GRU. GRU as a concept, is a little newer than LSTM. It is generally more efficient - it trains models at a quicker rate than LSTM. It is also easier to use. Any modifications you need to make to a model can be done fairly easily. However, LSTM should perform better than GRU where longer term memory is required. Ultimately, comparing performance is going to depend on the data set you are using. 4 ways to enable Continual learning into Neural Networks Build a generative chatbot using recurrent neural networks (LSTM RNNs) How Deep Neural Networks can improve Speech Recognition and generation
Read more
  • 0
  • 0
  • 12234

article-image-3-ways-to-deploy-a-qt-and-opencv-application
Gebin George
02 Apr 2018
16 min read
Save for later

3 ways to deploy a QT and OpenCV application

Gebin George
02 Apr 2018
16 min read
[box type="note" align="" class="" width=""]This article is an excerpt from the book, Computer Vision with OpenCV 3 and Qt5 written by Amin Ahmadi Tazehkandi.  This book covers how to build, test, and deploy Qt and OpenCV apps, either dynamically or statically.[/box] Today, we will learn three different methods to deploy a QT + OpenCV application. It is extremely important to provide the end users with an application package that contains everything it needs to be able to run on the target platform. And demand very little or no effort at all from the users in terms of taking care of the required dependencies. Achieving this kind of works-out-of-the-box condition for an application relies mostly on the type of the linking (dynamic or static) that is used to create an application, and also the specifications of the target operating system. Deploying using static linking Deploying an application statically means that your application will run on its own and it eliminates having to take care of almost all of the needed dependencies, since they are already inside the executable itself. It is enough to simply make sure you select the Release mode while building your application, as seen in the following screenshot: When your application is built in the Release mode, you can simply pick up the produced executable file and ship it to your users. If you try to deploy your application to Windows users, you might face an error similar to the following when your application is executed: The reason for this error is that on Windows, even when building your Qt application statically, you still need to make sure that Visual C++ Redistributables exist on the target system. This is required for C++ applications that are built by using Microsoft Visual C++, and the version of the required redistributables correspond to the Microsoft Visual Studio installed on your computer. In our case, the official title of the installer for these libraries is Visual C++ Redistributables for Visual Studio 2015, and it can be downloaded from the following link: https:/ / www. microsoft. com/en- us/ download/ details. aspx? id= 48145. It is a common practice to include the redistributables installer inside the installer for our application and perform a silent installation of them if they are not already installed. This process happens with most of the applications you use on your Windows PCs, most of the time, without you even noticing it. We already quite briefly talked about the advantages (fewer files to deploy) and disadvantages (bigger executable size) of static linking. But when it is meant in the context of deployment, there are some more complexities that need to be considered. So, here is another (more complete) list of disadvantages, when using static linking to deploy your applications: The building takes more time and the executable size gets bigger and bigger. You can't mix static and shared (dynamic) Qt libraries, which means you can't use the power of plugins and extending your application without building everything from scratch. Static linking, in a sense, means hiding the libraries used to build an application. Unfortunately, this option is not offered with all libraries, and failing to comply with it can lead to licensing issues with your application. This complexity arises partly because of the fact that Qt Framework uses some third-party libraries that do not offer the same set of licensing options as Qt itself. Talking about licensing issues is not a discussion suitable for this book, so we'll suffice with mentioning that you must be careful when you plan to create commercial applications using static linking of Qt libraries. For a detailed list of licenses used by third-party libraries within Qt, you can always refer to the Licenses Used in Qt web page from the following link: http://doc.qt.io/qt-5/ licenses-used-in-qt.html Static linking, even with all of its disadvantages that we just mentioned, is still an option, and a good one in some cases, provided that you can comply with the licensing options of the Qt Framework. For instance, in Linux operating systems where creating an installer for our application requires some extra work and care, static linking can help extremely reduce the effort needed to deploy applications (merely a copy and paste). So, the final decision of whether to use static linking or not is mostly on you and how you plan to deploy your application. Making this important decision will be much easier by the end of this chapter, when you have an overview of the possible linking and deployment methods. Deploying using dynamic linking When you deploy an application built with Qt and OpenCV using shared libraries (or dynamic linking), you need to make sure that the executable of your application is able to reach the runtime libraries of Qt and OpenCV, in order to load and use them. This reachability or visibility of runtime libraries can have different meanings depending on the operating system. For instance, on Windows, you need to copy the runtime libraries to the same folder where your application executable resides, or put them in a folder that is appended to the PATH environment value. Qt Framework offers command-line tools to simplify the deployment of Qt applications on Windows and macOS. As mentioned before, the first thing you need to do is to make sure your application is built in the Release mode, and not Debug mode. Then, if you are on Windows, first copy the executable (let us assume it is called app.exe) from the build folder into a separate folder (which we will refer to as deploy_path) and execute the following commands using a command-line instance: cd deploy_path QT_PATHbinwindeployqt app.exe The windeployqt tool is a deployment helper tool that simplifies the process of copying the required Qt runtime libraries into the same folder as the application executable. Itsimply takes an executable as a parameter and after determining the modules used to create it, copies all required runtime libraries and any additional required dependencies, such as Qt plugins, translations, and so on. This takes care of all the required Qt runtime libraries, but we still need to take care of OpenCV runtime libraries. If you followed all of the steps in Chapter 1, Introduction to OpenCV and Qt, for building OpenCV libraries dynamically, then you only need to manually copy the opencv_world330.dll and opencv_ffmpeg330.dll files from OpenCV installation folder (inside the x86vc14bin folder) into the same folder where your application executable resides. We didn't really go into the benefits of turning on the BUILD_opencv_world option when we built OpenCV in the early chapters of the book; however, it should be clear now that this simplifies the deployment and usage of the OpenCV libraries, by requiring only a single entry for LIBS in the *.pro file and manually copying only a single file (not counting the ffmpeg library) when deploying OpenCV applications. It should be also noted that this method has the disadvantage of copying all OpenCV codes (in a single library) along your application even when you do not need or use all of its modules in a project. Also note that on Windows, as mentioned in the Deploying using static linking section, you still need to similarly provide the end users of your application with Microsoft Visual C++ Redistributables. On a macOS operating system, it is also possible to easily deploy applications written using Qt Framework. For this reason, you can use the macdeployqt command-line tool provided by Qt. Similar to windeployqt, which accepts a Windows executable and fills the same folder with the required libraries, macdeployqt accepts a macOS application bundle and makes it deployable by copying all of the required Qt runtimes as private frameworks inside the bundle itself. Here is an example: cd deploy_path QT_PATH/bin/macdeployqt my_app_bundle Optionally, you can also provide an additional -dmg parameter, which leads to the creation of a macOS *.dmg (disk image) file. As for the deployment of OpenCV libraries when dynamic linking is used, you can create an installer using Qt Installer Framework (which we will learn about in the next section), a third-party provider, or a script that makes sure the required runtime libraries are copied to their required folders. This is because of the fact that simply copying your runtime libraries (whether it is OpenCV or anything else) to the same folder as the application executable does not help with making them visible to an application on macOS. The same also applies to the Linux operating system, where unfortunately even a tool for deploying Qt runtime libraries does not exist (at least for the moment), so we also need to take care of Qt libraries in addition to OpenCV libraries, either by using a trusted third-party provider (which you can search for online) or by using the cross-platform installer provided by Qt itself, combined with some scripting to make sure everything is in place when our application is executed. Deploy using Qt Installer Framework Qt Installer Framework allows you to create cross-platform installers of your Qt applications for Windows, macOS, and Linux operating systems. It allows for creating standard installer wizards where the user is taken through consecutive dialogs that provide all the necessary information, and finally display the progress for when the application is being installed and so on, similar to most of installations you have probably faced, and especially the installation of Qt Framework itself. Qt Installer Framework is based on Qt Framework itself but is provided as a different package and does not require Qt SDK (Qt Framework, Qt Creator, and so on) to be present on a computer. It is also possible to use Qt Installer Framework in order to create installer packages for any application, not just Qt applications. In this section, we are going to learn how to create a basic installer using Qt Installer Framework, which takes care of installing your application on a target computer and copying all the necessary dependencies. The result will be a single executable installer file that you can put on a web server to be downloaded or provide it in a USB stick or CD, or any other media type. This example project will help you get started with working your way around the many great capabilities of Qt Installer Framework by yourself. You can use the following link to download and install the Qt Installer Framework. Make sure to simply download the latest version when you use this link, or any other source for downloading it. At the moment, the latest version is 3.0.2: https://download.qt.io/official_releases/qt-installer-framework After you have downloaded and installed Qt Installer Framework, you can start creating the required files that Qt Installer Framework needs in order to create an installer. You can do this by simply browsing to the Qt Installer Framework, and from the examples folder copying the tutorial folder, which is also a template in case you want to quickly rename and re-edit all of the files and create your installer quickly. We will go the other way and create them manually; first because we want to understand the structure of the required files and folders for the Qt Installer Framework, and second, because it is still quite easy and simple. Here are the required steps for creating an installer: Assuming that you have already finished developing your Qt and OpenCV application, you can start by creating a new folder that will contain the installer files. Let's assume this folder is called deploy. Create an XML file inside the deploy folder and name it config.xml. This XML file must contain the following: <?xml version="1.0" encoding="UTF-8"?> <Installer> <Name>Your application</Name> <Version>1.0.0</Version> <Title>Your application Installer</Title> <Publisher>Your vendor</Publisher> <StartMenuDir>Super App</StartMenuDir> <TargetDir>@HomeDir@/InstallationDirectory</TargetDir> </Installer> Make sure to replace the required XML fields in the preceding code with information relevant to your application and then save and close this file: Now, create a folder named packages inside the deploy folder. This folder will contain the individual packages that you want the user to be able to install, or make them mandatory or optional so that the user can review and decide what will be installed. In the case of simpler Windows applications that are written using Qt and OpenCV, usually it is enough to have just a single package that includes the required files to run your application, and even do silent installation of Microsoft Visual C++ Redistributables. But for more complex cases, and especially when you want to have more control over individual installable elements of your application, you can also go for two or more packages, or even sub-packages. This is done by using domain-like folder names for each package. Each package folder can have a name like com.vendor.product, where vendor and product are replaced by the developer name or company and the application. A subpackage (or sub-component) of a package can be identified by adding. subproduct to the name of the parent package. For instance, you can have the following folders inside the packages folder: com.vendor.product com.vendor.product.subproduct1 com.vendor.product.subproduct2 com.vendor.product.subproduct1.subsubproduct1 … This can go on for as many products (packages) and sub-products (sub-packages) as we like. For our example case, let's create a single folder that contains our executable, since it describes it all and you can create additional packages by simply adding them to the packages folder. Let's name it something like com.amin.qtcvapp. Now, follow these required steps: Now, create two folders inside the new package folder that we created, the com.amin.qtcvapp folder. Rename them to data and meta. These two folders must exist inside all packages. Copy your application files inside the data folder. This folder will be extracted into the target folder exactly as it is (we will talk about setting the target folder of a package in the later steps). In case you are planning to create more than one package, then make sure to separate their data correctly and in a way that it makes sense. Of course, you won't be faced with any errors if you fail to do so, but the users of your application will probably be confused, for instance by skipping a package that should be installed at all times and ending up with an installed application that does not work. Now, switch to the meta folder and create the following two files inside that folder, and fill them with the codes provided for each one of them. The package.xml file should contain the following. There's no need to mention that you must fill the fields inside the XML with values relevant to your package: <?xml version="1.0" encoding="UTF-8"?> <Package> <DisplayName>The component</DisplayName> <Description>Install this component.</Description> <Version>1.0.0</Version> <ReleaseDate>1984-09-16</ReleaseDate> <Default>script</Default> <Script>installscript.qs</Script> </Package> The script in the previous XML file, which is probably the most important part of the creation of an installer, refers to a Qt Installer Script (*.qs file), which is named installerscript.qs and can be used to further customize the package, its target folder, and so on. So, let us create a file with the same name (installscript.qs) inside the meta folder, and use the following code inside it: function Component() { // initializations go here } Component.prototype.isDefault = function() { // select (true) or unselect (false) the component by default return true; } Component.prototype.createOperations = function() { try { // call the base create operations function component.createOperations(); } catch (e) { console.log(e); } } This is the most basic component script, which customizes our package (well, it only performs the default actions) and it can optionally be extended to change the target folder, create shortcuts in the Start menu or desktop (on Windows), and so on. It is a good idea to keep an eye on the Qt Installer Framework documentation and learn about its scripting to be able to create more powerful installers that can put all of the required dependencies of your app in place, and automatically. You can also browse through all of the examples inside the examples folder of the Qt Installer Framework and learn how to deal with different deployment cases. For instance, you can try to create individual packages for Qt and OpenCV dependencies and allow the users to deselect them, in case they already have the Qt runtime libraries on their computer. The last step is to use the binarycreator tool to create our single and standalone installer. Simply run the following command by using a Command Prompt (or Terminal) instance: binarycreator -p packages -c config.xml myinstaller The binarycreator is located inside the Qt Installer Framework bin folder. It requires two parameters that we have already prepared. -p must be followed by our packages folder and -c must be followed by the configuration file (or config.xml) file. After executing this command, you will get myinstaller (on Windows, you can append *.exe to it), which you can execute to install your application. This single file should contain all of the required files needed to run your application, and the rest is taken care of. You only need to provide a download link to this file, or provide it on a CD to your users. The following are the dialogs you will face in this default and most basic installer, which contains most of the usual dialogs you would expect when installing an application: If you go to the installation folder, you will notice that it contains a few more files than you put inside the data folder of your package. Those files are required by the installer to handle modifications and uninstall your application. For instance, the users of your application can easily uninstall your application by executing the maintenance tool executable, which would produce another simple and user-friendly dialog to handle the uninstall process: We saw how to deploy a QT + OpenCV applications using static linking, dynamic linking, and QT installer. If you found our post useful, do check out this book Computer Vision with OpenCV 3 and Qt5  to accentuate your OpenCV applications by developing them with Qt.  
Read more
  • 0
  • 0
  • 12185
article-image-use-m-functions-within-power-bi-querying-data
Amarabha Banerjee
21 May 2018
10 min read
Save for later

How to use M functions within Microsoft Power BI for querying data

Amarabha Banerjee
21 May 2018
10 min read
Microsoft Power BI Desktop contains a rich set of data source connectors and transformation capabilities that support the integration and enhancement of source data. These features are all driven by a powerful functional language and query engine, M, which leverages source system resources when possible and can greatly extend the scope and robustness of the data retrieval process beyond the possibilities of the standard query editor interface alone. As with almost all BI projects, the design and development of the data access and retrieval process has great implications for the analytical value, scalability, and sustainability of the overall Power BI solution. [box type="note" align="" class="" width=""]Our article is an excerpt from the book Microsoft Power BI Cookbook, written by Brett Powell. This book shows how to leverage  Microsoft Power BI and the development tools to create better data driven analytics and visualizations. [/box] In this article, we dive into Power BI Desktop's Get Data experience and go through the process of establishing and managing data source connections and queries. Examples are provided of using the Query Editor interface and the M language directly to construct and refine queries to meet common data transformation and cleansing needs. In practice and as per the examples, a combination of both tools is recommended to aid the query development process. Viewing and analyzing M functions Every time you click on a button to connect to any of Power BI Desktop's supported data sources or apply any transformation to a data source object, such as changing a column's data type, one or multiple M expressions are created reflecting your choices. These M expressions are automatically written to dedicated M documents and, if saved, are stored within the Power BI Desktop file as Queries. M is a functional programming language like F#, and it's important that Power BI developers become familiar with analyzing and later writing and enhancing the M code that supports their queries. Getting ready Build a query through the user interface that connects to the AdventureWorksDW2016CTP3 SQL Server database on the ATLAS server and retrieves the DimGeography table, filtered by United States for English. Click on Get Data from the Home tab of the ribbon, select SQL Server from the list of database sources, and provide the server and database names. For the Data Connectivity mode, select Import. A navigation window will appear, with the different objects and schemas of the database. Select the DimGeography table from the Navigation window and click on Edit. In the Query Editor window, select the EnglishCountryRegionName column and then filter on United States from its dropdown. Figure 2: Filtering for United States only in the Query Editor At this point, a preview of the filtered table is exposed in the Query Editor and the Query Settings pane displays the previous steps. Figure 3: The Query Settings pane in the Query Editor How to do it Formula Bar With the Formula Bar visible in the Query Editor, click on the Source step under Applied Steps in the Query Settings pane. You should see the following formula expression: Figure 4: The SQL.Database() function created for the Source step Click on the Navigation step to expose the following expression: Figure 5: The metadata record created for the Navigation step The navigation expression (2) references the source expression (1) The Formula Bar in the Query Editor displays individual query steps, which are technically individual M expressions It's convenient and very often essential to view and edit all the expressions in a centralized window, and for this, there's the Advanced Editor M is a functional language, and it can be useful to think of query evaluation in M as similar to Excel spreadsheet formulas in which multiple formulas can reference each other. The M engine can determine which expressions are required by the final expression to return and evaluate only those expressions. Configuring Power BI Development Tools, the display setting for both the Query Settings pane and the Formula bar should be enabled as GLOBAL | Query Editor options. Figure 6: Global layout options for the Query Editor Alternatively, on a per file basis, you can control these settings and others from the View tab of the Query Editor toolbar. Figure 7: Property settings of the View tab in the Query Editor Advanced Editor window Given its importance to the query development process, the Advanced Editor dialog is exposed on both the Home and View tabs of the Query Editor. It's recommended to use the Query Editor when getting started with a new query and when learning the M language. After several steps have been applied, use the Advanced Editor to review and optionally enhance or customize the M query. As a rich, functional programming language, there are many M functions and optional parameters not exposed via the Query Editor; going beyond the limits of the Query Editor enables more robust data retrieval and integration processes. Figure 8: The Home tab of the Query Editor Click on Advanced Editor from either the View or Home tabs (Figure 8 and Figure 9, respectively). All M function expressions and any comments are exposed Figure 9: The Advanced Editor view of the DimGeography query When developing retrieval processes for Power BI models, consider these common ETL questions: How are our queries impacting the source systems? Can we make our retrieval queries more resilient to changes in source data such that they avoid failure? Is our retrieval process efficient and simple to follow and support or are there unnecessary steps and queries? Are our retrieval queries delivering sufficient performance to the BI application? Is our process flexible such that we can quickly apply changes to data sources and logic? M queries are not intended as a substitute for the workloads typically handled by enterprise ETL tools such as SSIS or Informatica. However, just as BI professionals would carefully review the logic and test the performance of SQL stored procedures and ETL packages supporting their various cubes and reports environment, they should also review the M queries created to support Power BI models and reports. How it works Two of the top performance and scalability features of M's engine are Query Folding and Lazy Evaluation. If possible, the M queries developed in Power BI Desktop are converted (folded) into SQL statements and passed to source systems for processing. M can also reduce the required resources for a given query by ignoring any unnecessary or redundant steps (variables). M is a case-sensitive language. This includes referencing variables in M expressions (RenameColumns versus Renamecolumns) as well as the values in M queries. For example, the values "Apple" and "apple" are considered unique values in an M query; the Table.Distinct() function will not remove rows for one of the values. Variable names in M expressions cannot have spaces without a hash sign and double quotes. Per Figure 10, when the Query Editor graphical interface is used to create M queries this syntax is applied automatically, along with a name describing the M transformation applied. Applying short, descriptive variable names (with no spaces) improves the readability of M queries.  Query folding The query from this recipe was "folded" into the following SQL statement and sent to the ATLAS server for processing. Figure 10: The SQL statement generated from the DimGeography M query Right-click on the Filtered Rows step and select View Native Query to access the Native Query window from Figure 11: Figure 11: View Native Query in Query Settings Finding and revising queries that are not being folded to source systems is a top technique for enhancing large Power BI datasets. See the Pushing Query Processing Back to Source Systems recipe of Chapter 11, Enhancing and Optimizing Existing Power BI Solutions for an example of this process. M query structure The great majority of queries created for Power BI will follow the let...in structure as per this recipe, as they contain multiple steps with dependencies among them. Individual expressions are separated by commas. The expression referred to following the in keyword is the expression returned by the query. The individual step expressions are technically "variables", and if the identifiers for these variables (the names of the query steps) contain spaces then the step is placed in quotes, and prefixed with a # sign as per the Filtered Rows step in Figure 10. Lazy evaluation The M engine also has powerful "lazy evaluation" logic for ignoring any redundant or unnecessary variables, as well as short-circuiting evaluation (computation) once a result is determinate, such as when one side (operand) of an OR logical operator is computed as True. The order of evaluation of the expressions is determined at runtime; it doesn't have to be sequential from top to bottom. In the following example, a step for retrieving Canada was added and the step for the United States was ignored. Since the CanadaOnly variable satisfies the overall let expression of the query, only the Canada query is issued to the server as if the United States row were commented out or didn't exist. Figure 12: Revised query that ignores Filtered Rows step to evaluate Canada only View Native Query (Figure 12) is not available given this revision, but a SQL Profiler trace against the source database server (and a refresh of the M query) confirms that CanadaOnly was the only SQL query passed to the source database. Figure 13: Capturing the SQL statement passed to the server via SQL Server Profiler trace There's more Partial query folding A query can be "partially folded", in which a SQL statement is created resolving only part of an overall query The results of this SQL statement would be returned to Power BI Desktop (or the on-premises data gateway) and the remaining logic would be computed using M's in-memory engine with local resources M queries can be designed to maximize the use of the source system resources, by using standard expressions supported by query folding early in the query process Minimizing the use of local or on-premises data gateway resources is a top consideration Limitations of query folding No folding will take place once a native SQL query has been passed to the source system. For example, passing a SQL query directly through the Get Data dialog. The following query, specified in the Get Data dialog, is included in the Source Step: Figure 14: Providing a user defined native SQL query Any transformations applied after this native query will use local system resources. Therefore, the general implication for query development with native or user-defined SQL queries is that if they're used, try to include all required transformations (that is, joins and derived columns), or use them to utilize an important feature of the source database not being utilized by the folded query, such as an index. Not all data sources support query folding, such as text and Excel files. Not all transformations available in the Query Editor or via M functions directly are supported by some data sources. The privacy levels defined for the data sources will also impact whether folding is used or not. SQL statements are not parsed before they're sent to the source system. The Table.Buffer() function can be used to avoid query folding. The table output of this function is loaded into local memory and transformations against it will remain local. We have discussed effective techniques for accessing and retrieving data using Microsoft Power BI. Do check out this book Microsoft Power BI Cookbook for more information on using Microsoft power BI for data analysis and visualization. Expert Interview: Unlocking the secrets of Microsoft Power BI Tutorial: Building a Microsoft Power BI Data Model Expert Insights:Ride the third wave of BI with Microsoft Power BI    
Read more
  • 0
  • 0
  • 12172

article-image-clustering-and-other-unsupervised-learning-methods
Packt
09 Jul 2015
19 min read
Save for later

Clustering and Other Unsupervised Learning Methods

Packt
09 Jul 2015
19 min read
In this article by Ferran Garcia Pagans, author of the book Predictive Analytics Using Rattle and Qlik Sense, we will learn about the following: Define machine learning Introduce unsupervised and supervised methods Focus on K-means, a classic machine learning algorithm, in detail We'll create clusters of customers based on their annual money spent. This will give us a new insight. Being able to group our customers based on their annual money spent will allow us to see the profitability of each customer group and deliver more profitable marketing campaigns or create tailored discounts. Finally, we'll see hierarchical clustering, different clustering methods, and association rules. Association rules are generally used for market basket analysis. Machine learning – unsupervised and supervised learning Machine Learning (ML) is a set of techniques and algorithms that gives computers the ability to learn. These techniques are generic and can be used in various fields. Data mining uses ML techniques to create insights and predictions from data. In data mining, we usually divide ML methods into two main groups – supervisedlearning and unsupervisedlearning. A computer can learn with the help of a teacher (supervised learning) or can discover new knowledge without the assistance of a teacher (unsupervised learning). In supervised learning, the learner is trained with a set of examples (dataset) that contains the right answer; we call it the training dataset. We call the dataset that contains the answers a labeled dataset, because each observation is labeled with its answer. In supervised learning, you are supervising the computer, giving it the right answers. For example, a bank can try to predict the borrower's chance of defaulting on credit loans based on the experience of past credit loans. The training dataset would contain data from past credit loans, including if the borrower was a defaulter or not. In unsupervised learning, our dataset doesn't have the right answers and the learner tries to discover hidden patterns in the data. In this way, we call it unsupervised learning because we're not supervising the computer by giving it the right answers. A classic example is trying to create a classification of customers. The model tries to discover similarities between customers. In some machine learning problems, we don't have a dataset that contains past observations. These datasets are not labeled with the correct answers and we call them unlabeled datasets. In traditional data mining, the terms descriptive analytics and predictive analytics are used for unsupervised learning and supervised learning. In unsupervised learning, there is no target variable. The objective of unsupervised learning or descriptive analytics is to discover the hidden structure of data. There are two main unsupervised learning techniques offered by Rattle: Cluster analysis Association analysis Cluster analysis Sometimes, we have a group of observations and we need to split it into a number of subsets of similar observations. Cluster analysis is a group of techniques that will help you to discover these similarities between observations. Market segmentation is an example of cluster analysis. You can use cluster analysis when you have a lot of customers and you want to divide them into different market segments, but you don't know how to create these segments. Sometimes, especially with a large amount of customers, we need some help to understand our data. Clustering can help us to create different customer groups based on their buying behavior. In Rattle's Cluster tab, there are four cluster algorithms: KMeans EwKm Hierarchical BiCluster The two most popular families of cluster algorithms are hierarchical clustering and centroid-based clustering: Centroid-based clustering the using K-means algorithm I'm going to use K-means as an example of this family because it is the most popular. With this algorithm, a cluster is represented by a point or center called the centroid. In the initialization step of K-means, we need to create k number of centroids; usually, the centroids are initialized randomly. In the following diagram, the observations or objects are represented with a point and three centroids are represented with three colored stars: After this initialization step, the algorithm enters into an iteration with two operations. The computer associates each object with the nearest centroid, creating k clusters. Now, the computer has to recalculate the centroids' position. The new position is the mean of each attribute of every cluster member. This example is very simple, but in real life, when the algorithm associates the observations with the new centroids, some observations move from one cluster to the other. The algorithm iterates by recalculating centroids and assigning observations to each cluster until some finalization condition is reached, as shown in this diagram: The inputs of a K-means algorithm are the observations and the number of clusters, k. The final result of a K-means algorithm are k centroids that represent each cluster and the observations associated with each cluster. The drawbacks of this technique are: You need to know or decide the number of clusters, k. The result of the algorithm has a big dependence on k. The result of the algorithm depends on where the centroids are initialized. There is no guarantee that the result is the optimum result. The algorithm can iterate around a local optimum. In order to avoid a local optimum, you can run the algorithm many times, starting with different centroids' positions. To compare the different runs, you can use the cluster's distortion – the sum of the squared distances between each observation and its centroids. Customer segmentation with K-means clustering We're going to use the wholesale customer dataset we downloaded from the Center for Machine Learning and Intelligent Systems at the University of California, Irvine. You can download the dataset from here – https://archive.ics.uci.edu/ml/datasets/Wholesale+customers#. The dataset contains 440 customers (observations) of a wholesale distributor. It includes the annual spend in monetary units on six product categories – Fresh, Milk, Grocery, Frozen, Detergents_Paper, and Delicatessen. We've created a new field called Food that includes all categories except Detergents_Paper, as shown in the following screenshot: Load the new dataset into Rattle and go to the Cluster tab. Remember that, in unsupervised learning, there is no target variable. I want to create a segmentation based only on buying behavior; for this reason, I set Region and Channel to Ignore, as shown here: In the following screenshot, you can see the options Rattle offers for K-means. The most important one is Number of clusters; as we've seen, the analyst has to decide the number of clusters before running K-means: We have also seen that the initial position of the centroids can have some influence on the result of the algorithm. The position of the centroids is random, but we need to be able to reproduce the same experiment multiple times. When we're creating a model with K-means, we'll iteratively re-run the algorithm, tuning some options in order to improve the performance of the model. In this case, we need to be able to reproduce exactly the same experiment. Under the hood, R has a pseudo-random number generator based on a starting point called Seed. If you want to reproduce the exact same experiment, you need to re-run the algorithm using the same Seed. Sometimes, the performance of K-means depends on the initial position of the centroids. For this reason, sometimes you need to able to re-run the model using a different initial position for the centroids. To run the model with different initial positions, you need to run with a different Seed. After executing the model, Rattle will show some interesting information. The size of each cluster, the means of the variables in the dataset, the centroid's position, and the Within cluster sum of squares value. This measure, also called distortion, is the sum of the squared differences between each point and its centroid. It's a measure of the quality of the model. Another interesting option is Runs; by using this option, Rattle will run the model the specified number of times and will choose the model with the best performance based on the Within cluster sum of squares value. Deciding on the number of clusters can be difficult. To choose the number of clusters, we need a way to evaluate the performance of the algorithm. The sum of the squared distance between the observations and the associated centroid could be a performance measure. Each time we add a centroid to KMeans, the sum of the squared difference between the observations and the centroids decreases. The difference in this measure using a different number of centroids is the gain associated to the added centroids. Rattle provides an option to automate this test, called Iterative Clusters. If you set the Number of clusters value to 10 and check the Iterate Clusters option, Rattle will run KMeans iteratively, starting with 3 clusters and finishing with 10 clusters. To compare each iteration, Rattle provides an iteration plot. In the iteration plot, the blue line shows the sum of the squared differences between each observation and its centroid. The red line shows the difference between the current sum of squared distances and the sum of the squared distance of the previous iteration. For example, for four clusters, the red line has a very low value; this is because the difference between the sum of the squared differences with three clusters and with four clusters is very small. In the following screenshot, the peak in the red line suggests that six clusters could be a good choice. This is because there is an important drop in the Sum of WithinSS value at this point: In this way, to finish my model, I only need to set the Number of clusters to 3, uncheck the Re-Scale checkbox, and click on the Execute button: Finally, Rattle returns the six centroids of my clusters: Now we have the six centroids and we want Rattle to associate each observation with a centroid. Go to the Evaluate tab, select the KMeans option, select the Training dataset, mark All in the report type, and click on the Execute button as shown in the following screenshot. This process will generate a CSV file with the original dataset and a new column called kmeans. The content of this attribute is a label (a number) representing the cluster associated with the observation (customer), as shown in the following screenshot: After clicking on the Execute button, you will need to choose a folder to save the resulting file to and will have to type in a filename. The generated data inside the CSV file will look similar to the following screenshot: In the previous screenshot, you can see ten lines of the resulting file; note that the last column is kmeans. Preparing the data in Qlik Sense Our objective is to create the data model, but using the new CSV file with the kmeans column. We're going to update our application by replacing the customer data file with this new data file. Save the new file in the same folder as the original file, open the Qlik Sense application, and go to Data load editor. There are two differences between the original file and this one. In the original file, we added a line to create a customer identifier called Customer_ID, and in this second file we have this field in the dataset. The second difference is that in this new file we have the kmeans column. From Data load editor, go to the Wholesale customer data sheet, modify line 2, and add line 3. In line 2, we just load the content of Customer_ID, and in line 3, we load the content of the kmeans field and rename it to Cluster, as shown in the following screenshot. Finally, update the name of the file to be the new one and click on the Load data button: When the data load process finishes, open the data model viewer to check your data model, as shown here: Note that you have the same data model with a new field called Cluster. Creating a customer segmentation sheet in Qlik Sense Now we can add a sheet to the application. We'll add three charts to see our clusters and how our customers are distributed in our clusters. The first chart will describe the buying behavior of each cluster, as shown here: The second chart will show all customers distributed in a scatter plot, and in the last chart we'll see the number of customers that belong to each cluster, as shown here: I'll start with the chart to the bottom-right; it's a bar chart with Cluster as the dimension and Count([Customer_ID]) as the measure. This simple bar chart has something special – colors. Each customer's cluster has a special color code that we use in all charts. In this way, cluster 5 is blue in the three charts. To obtain this effect, we use this expression to define the color as color(fieldindex('Cluster', Cluster)), which is shown in the following screenshot: You can find this color trick and more in this interesting blog by Rob Wunderlich – http://qlikviewcookbook.com/. My second chart is the one at the top. I copied the previous chart and pasted it onto a free place. I kept the dimension but I changed the measure by using six new measures: Avg([Detergents_Paper]) Avg([Delicassen]) Avg([Fresh]) Avg([Frozen]) Avg([Grocery]) Avg([Milk]) I placed my last chart at the bottom-left. I used a scatter plot to represent all of my 440 customers. I wanted to show the money spent by each customer on food and detergents, and its cluster. I used the y axis to show the money spent on detergents and the x axis for the money spent on food. Finally, I used colors to highlight the cluster. The dimension is Customer_Id and the measures are Delicassen+Fresh+Frozen+Grocery+Milk (or Food) and [Detergents_Paper]. As the final step, I reused the color expression from the earlier charts. Now our first Qlik Sense application has two sheets – the original one is 100 percent Qlik Sense and helps us to understand our customers, channels, and regions. This new sheet uses clustering to give us a different point of view; this second sheet groups the customers by their similar buying behavior. All this information is useful to deliver better campaigns to our customers. Cluster 5 is our least profitable cluster, but is the biggest one with 227 customers. The main difference between cluster 5 and cluster 2 is the amount of money spent on fresh products. Can we deliver any offer to customers in cluster 5 to try to sell more fresh products? Select retail customers and ask yourself, who are our best retail customers? To which cluster do they belong? Are they buying all our product categories? Hierarchical clustering Hierarchical clustering tries to group objects based on their similarity. To explain how this algorithm works, we're going to start with seven points (or observations) lying in a straight line: We start by calculating the distance between each point. I'll come back later to the term distance; in this example, distance is the difference between two positions in the line. The points D and E are the ones with the smallest distance in between, so we group them in a cluster, as shown in this diagram: Now, we substitute point D and point E for their mean (red point) and we look for the two points with the next smallest distance in between. In this second iteration, the closest points are B and C, as shown in this diagram: We continue iterating until we've grouped all observations in the dataset, as shown here: Note that, in this algorithm, we can decide on the number of clusters after running the algorithm. If we divide the dataset into two clusters, the first cluster is point G and the second cluster is A, B, C, D, E, and F. This gives the analyst the opportunity to see the big picture before deciding on the number of clusters. The lowest level of clustering is a trivial one; in this example, seven clusters with one point in each one. The chart I've created while explaining the algorithm is a basic form of a dendrogram. The dendrogram is a tree diagram used in Rattle and in other tools to illustrate the layout of the clusters produced by hierarchical clustering. In the following screenshot, we can see the dendrogram created by Rattle for the wholesale customer dataset. In Rattle's dendrogram, the y axis represent all observations or customers in the dataset, and the x axis represents the distance between the clusters: Association analysis Association rules or association analysis is also an important topic in data mining. This is an unsupervised method, so we start with an unlabeled dataset. An unlabeled dataset is a dataset without a variable that gives us the right answer. Association analysis attempts to find relationships between different entities. The classic example of association rules is market basket analysis. This means using a database of transactions in a supermarket to find items that are bought together. For example, a person who buys potatoes and burgers usually buys beer. This insight could be used to optimize the supermarket layout. Online stores are also a good example of association analysis. They usually suggest to you a new item based on the items you have bought. They analyze online transactions to find patterns in the buyer's behavior. These algorithms assume all variables are categorical; they perform poorly with numeric variables. Association methods need a lot of time to be completed; they use a lot of CPU and memory. Remember that Rattle runs on R and the R engine loads all data into RAM memory. Suppose we have a dataset such as the following: Our objective is to discover items that are purchased together. We'll create rules and we'll represent these rules like this: Chicken, Potatoes → Clothes This rule means that when a customer buys Chicken and Potatoes, he tends to buy Clothes. As we'll see, the output of the model will be a set of rules. We need a way to evaluate the quality or interest of a rule. There are different measures, but we'll use only a few of them. Rattle provides three measures: Support Confidence Lift Support indicates how often the rule appears in the whole dataset. In our dataset, the rule Chicken, Potatoes → Clothes has a support of 48.57 percent (3 occurrences / 7 transactions). Confidence measures how strong rules or associations are between items. In this dataset, the rule Chicken, Potatoes → Clothes has a confidence of 1. The items Chicken and Potatoes appear three times in the dataset and the items Chicken, Potatoes, and Clothes appear three times in the dataset; and 3/3 = 1. A confidence close to 1 indicates a strong association. In the following screenshot, I've highlighted the options on the Associate tab we have to choose from before executing an association method in Rattle: The first option is the Baskets checkbox. Depending on the kind of input data, we'll decide whether or not to check this option. If the option is checked, such as in the preceding screenshot, Rattle needs an identification variable and a target variable. After this example, we'll try another example without this option. The second option is the minimum Support value; by default, it is set to 0.1. Rattle will not return rules with a lower Support value than the one you have set in this text box. If you choose a higher value, Rattle will only return rules that appear many times in your dataset. If you choose a lower value, Rattle will return rules that appear in your dataset only a few times. Usually, if you set a high value for Support, the system will return only the obvious relationships. I suggest you start with a high Support value and execute the methods many times with a lower value in each execution. In this way, in each execution, new rules will appear that you can analyze. The third parameter you have to set is Confidence. This parameter tells you how strong the rule is. Finally, the length is the number of items that contains a rule. A rule like Beer è Chips has length of two. The default option for Min Length is 2. If you set this variable to 2, Rattle will return all rules with two or more items in it. After executing the model, you can see the rules created by Rattle by clicking on the Show Rules button, as illustrated here: Rattle provides a very simple dataset to test the association rules in a file called dvdtrans.csv. Test the dataset to learn about association rules. Further learning In this article, we introduced supervised and unsupervised learning, the two main subgroups of machine learning algorithms; if you want to learn more about machine learning, I suggest you complete a MOOC course called Machine Learning at Coursera: https://www.coursera.org/learn/machine-learning The acronym MOOC stands for Massive Open Online Course; these are courses open to participation via the Internet. These courses are generally free. Coursera is one of the leading platforms for MOOC courses. Machine Learning is a great course designed and taught by Andrew Ng, Associate Professor at Stanford University; Chief Scientist at Baidu; and Chairman and Co-founder at Coursera. This course is really interesting. A very interesting book is Machine Learning with R by Brett Lantz, Packt Publishing. Summary In this article, we were introduced to machine learning, and supervised and unsupervised methods. We focused on unsupervised methods and covered centroid-based clustering, hierarchical clustering, and association rules. We used a simple dataset, but we saw how a clustering algorithm can complement a 100 percent Qlik Sense approach by adding more information. Resources for Article: Further resources on this subject: Qlik Sense's Vision [article] Securing QlikView Documents [article] Conozca QlikView [article]
Read more
  • 0
  • 0
  • 12152
Modal Close icon
Modal Close icon