





















































In this article by Christian Cote, Matija Lah, and Dejan Sarka, the author of the book SQL Server 2016 Integration Services Cookbook, we will see how to install Azure Feature Pack that in turn, will install Azure control flow tasks and data flow components And we will also see how to use the Fuzzy Lookup transformation for identity mapping.
(For more resources related to this topic, see here.)
In the early years of SQL Server, Microsoft introduced a tool to help developers and database administrator (DBA) to interact with the data: data transformation services (DTS). The tool was very primitive compared to SSIS and it was mostly relying on ActiveX and TSQL to transform the data. SSIS 1.0 appears in 2005. The tool was a game changer in the ETL world at the time. It was a professional and (pretty much) reliable tool for 2005. 2008/2008R2 versions were much the same as 2005 in a sense that they didn't add much functionality but they made the tool more scalable.
In 2012, Microsoft enhanced SSIS in many ways. They rewrote the package XML to ease source control integration and make package code easier to read. They also greatly enhanced the way packages are deployed by using a SSIS catalog in SQL Server. Having the catalog in SQL Server gives us execution reports and many views that give us access to metadata or metaprocess information's in our projects.
Version 2014 didn't have anything for SSIS. Version 2016 brings other set of features as you will see. We now also have the possibility to integrate with big data.
Business intelligence projects many times reveal previously unseen issues with the quality of the source data. Dealing with data quality includes data quality assessment, or data profiling, data cleansing, and maintaining high quality over time.
In SSIS, the data profiling task helps you with finding unclean data. The Data Profiling task is not like the other tasks in SSIS because it is not intended to be run over and over again through a scheduled operation. Think about SSIS being the wrapper for this tool. You use the SSIS framework to configure and run the Data Profiling task, and then you observe the results through the separate Data Profile Viewer. The output of the Data Profiling task will be used to help you in your development and design of the ETL and dimensional structures in your solution. Periodically, you may want to rerun the Data Profile task to see how the data has changed, but the package you develop will not include the task in the overall recurring ETL process
Azure ecosystem is becoming predominant in Microsoft ecosystem and SSIS has not been left over in the past few years.
The Azure Feature Pack is not a SSIS 2016 specific feature. It's also available for SSIS version 2012 and 2014. It's worth mentioning that it appeared in July 2015, a few months before SSIS 2016 release.
This section assumes that you have installed SQL Server Data Tools 2015.
We'll start SQL Server Data Tools, and open the CustomLogging project if not already done:
SSIS includes two really sophisticated matching transformations in the data flow. The Fuzzy Lookup transformation is used for mapping the identities. The Fuzzy Grouping Transformation is used for de-duplicating. Both of them use the same algorithm for comparing the strings and other data.
Identity mapping and de-duplication are actually the same problem. For example, instead for mapping the identities of entities in two tables, you can union all of the data in a single table and then do the de-duplication. Or vice versa, you can join a table to itself and then do identity mapping instead of de-duplication.
This recipe assumes that you have successfully finished the previous recipe.
CREATE TABLE dbo.FuzzyMatchingResults
(
CustomerKey INT NOT NULL PRIMARY KEY,
FullName NVARCHAR(200) NULL,
StreetAddress NVARCHAR(200) NULL,
Updated INT NULL,
CleanCustomerKey INT NULL
);
When the package executes the transformation for the first time, it copies the reference table, adds a key with an integer datatype to the new table, and builds an index on the key column. Next, the transformation builds an index, called a match index, on the copy of the reference table. The match index stores the results of tokenizing the values in the transformation input columns. The transformation then uses these tokens in the lookup operation. The match index is a table in a SQL Server database. When the package runs again, the transformation can either use an existing match index or create a new index. If the reference table is static, the package can avoid the potentially expensive process of rebuilding the index for repeat sessions of data cleansing.
TRUNCATE TABLE dbo.CustomersDirtyMatch;
TRUNCATE TABLE dbo.CustomersDirtyNoMatch;
TRUNCATE TABLE dbo.FuzzyMatchingResults;
-- Not matched
SELECT * FROM FuzzyMatchingResults
WHERE CleanCustomerKey IS NULL;
-- Incorrect matches
SELECT * FROM FuzzyMatchingResults
WHERE CleanCustomerKey <> CustomerKey * (-1);
USE AdventureWorksDW2014;
DROP TABLE IF EXISTS dbo.Chapter05Profiling;
DROP TABLE IF EXISTS dbo.AWCitiesStatesCountries;
USE DQS_STAGING_DATA;
DROP TABLE IF EXISTS dbo.CustomersCh05;
DROP TABLE IF EXISTS dbo.CustomersCh05DQS;
DROP TABLE IF EXISTS dbo.CustomersClean;
DROP TABLE IF EXISTS dbo.CustomersDirty;
DROP TABLE IF EXISTS dbo.CustomersDirtyMatch;
DROP TABLE IF EXISTS dbo.CustomersDirtyNoMatch;
DROP TABLE IF EXISTS dbo.CustomersDQSMatch;
DROP TABLE IF EXISTS dbo.DQSMatchingResults;
DROP TABLE IF EXISTS dbo.DQSSurvivorshipResults;
DROP TABLE IF EXISTS dbo.FuzzyMatchingResults;
SQL Server data quality services (DQS) is a knowledge-driven data quality solution. This means that it requires you to maintain one or more knowledge bases (KBs). In a KB, you maintain all knowledge related to a specific portion of data—for example, customer data. The idea of data quality services is to mitigate the cleansing process. While the amount of time you need to spend on cleansing decreases, you will achieve higher and higher levels of data quality. While cleansing, you learn what types of errors to expect, discover error patterns, find domains of correct values, and so on. You don't throw away this knowledge. You store it and use it to find and correct the same issues automatically during your next cleansing process.
We have seen how to install Azure Feature Pack, Azure control flow tasks and data flow components, and Fuzzy Lookup transformation.
Further resources on this subject: