In this article by Christian Cote, Matija Lah, and Dejan Sarka, the author of the book SQL Server 2016 Integration Services Cookbook, we will see how to install Azure Feature Pack that in turn, will install Azure control flow tasks and data flow components And we will also see how to use the Fuzzy Lookup transformation for identity mapping.

(For more resources related to this topic, see here.)

In the early years of SQL Server, Microsoft introduced a tool to help developers and database administrator (DBA) to interact with the data: data transformation services (DTS). The tool was very primitive compared to SSIS and it was mostly relying on ActiveX and TSQL to transform the data. SSIS 1.0 appears in 2005. The tool was a game changer in the ETL world at the time. It was a professional and (pretty much) reliable tool for 2005. 2008/2008R2 versions were much the same as 2005 in a sense that they didn't add much functionality but they made the tool more scalable.

In 2012, Microsoft enhanced SSIS in many ways. They rewrote the package XML to ease source control integration and make package code easier to read. They also greatly enhanced the way packages are deployed by using a SSIS catalog in SQL Server. Having the catalog in SQL Server gives us execution reports and many views that give us access to metadata or metaprocess information's in our projects.

Version 2014 didn't have anything for SSIS. Version 2016 brings other set of features as you will see. We now also have the possibility to integrate with big data.

Business intelligence projects many times reveal previously unseen issues with the quality of the source data. Dealing with data quality includes data quality assessment, or data profiling, data cleansing, and maintaining high quality over time.

In SSIS, the data profiling task helps you with finding unclean data. The Data Profiling task is not like the other tasks in SSIS because it is not intended to be run over and over again through a scheduled operation. Think about SSIS being the wrapper for this tool. You use the SSIS framework to configure and run the Data Profiling task, and then you observe the results through the separate Data Profile Viewer. The output of the Data Profiling task will be used to help you in your development and design of the ETL and dimensional structures in your solution. Periodically, you may want to rerun the Data Profile task to see how the data has changed, but the package you develop will not include the task in the overall recurring ETL process

Azure tasks and transforms

Azure ecosystem is becoming predominant in Microsoft ecosystem and SSIS has not been left over in the past few years.

The Azure Feature Pack is not a SSIS 2016 specific feature. It's also available for SSIS version 2012 and 2014. It's worth mentioning that it appeared in July 2015, a few months before SSIS 2016 release.

Getting ready

This section assumes that you have installed SQL Server Data Tools 2015.

How to do it...

We'll start SQL Server Data Tools, and open the CustomLogging project if not already done:

In the SSIS toolbox, scroll to the Azure group. Since the Azure tools are not installed with SSDT, the Azure group is disabled in the toolbox. The tools must be downloaded using a separate installer. Click on Azure group to expand it and click on Download Azure Feature Pack as shown in the following screenshot:
Your default browser opens and the Microsoft SQL Server 2016 Integration Services Feature Pack for Azure opens. Click on Download as shown in the following screenshot:
From the popup that appears, select both 32-bit and 64-bit version. The 32-bit version is necessary for SSIS package development since SSDT is a 32-bit program. Click Next as shown in the following screenshot:
As shown in the following screenshot, the files are downloaded:

Unlock access to the largest independent learning library in Tech for FREE!

Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.

Renews at $15.99/month. Cancel anytime
Once the download completes, run on one the installers downloaded. The following screen appears. In that case, the 32-bit (x86) version is being installed. Click Next to start the installation process:
As shown in the following screenshot, check the box near I accept the terms in the License Agreement and click Next. Then the installation starts.
The following screen appears once the installation is completed. Click Finish to close the screen:
Install the other feature pack you downloaded.
If SSDT is opened, close it. Start SSDT again and open the CustomLogging project. In the Azure group in the SSIS toolbox, you should now see the Azure tasks as in the following screenshot:

Using SSIS fuzzy components

SSIS includes two really sophisticated matching transformations in the data flow. The Fuzzy Lookup transformation is used for mapping the identities. The Fuzzy Grouping Transformation is used for de-duplicating. Both of them use the same algorithm for comparing the strings and other data.

Identity mapping and de-duplication are actually the same problem. For example, instead for mapping the identities of entities in two tables, you can union all of the data in a single table and then do the de-duplication. Or vice versa, you can join a table to itself and then do identity mapping instead of de-duplication.

Getting ready

This recipe assumes that you have successfully finished the previous recipe.

How to do it…

In SSMS, create a new table in the DQS_STAGING_DATA database in the dbo schema and name it dbo.FuzzyMatchingResults. Use the following code:

CREATE TABLE dbo.FuzzyMatchingResults
(
  CustomerKey              INT NOT NULL PRIMARY KEY,
  FullName                 NVARCHAR(200)  NULL,
  StreetAddress            NVARCHAR(200)  NULL,
  Updated                  INT            NULL,
  CleanCustomerKey         INT            NULL
);

Switch to SSDT. Continue editing the DataMatching package.
Add a Fuzzy Lookup transformation below the NoMatch Multicast transformation. Rename it FuzzyMatches and connect it to the NoMatch Multicast transformation with the regular data flow path. Double-click the transformation to open its editor.
On the Reference Table tab, select the connection manager you want to use to connect to your DQS_STAGING_DATA database and select the dbo.CustomersClean table. Do not store a new index or use an existing index.

When the package executes the transformation for the first time, it copies the reference table, adds a key with an integer datatype to the new table, and builds an index on the key column. Next, the transformation builds an index, called a match index, on the copy of the reference table. The match index stores the results of tokenizing the values in the transformation input columns. The transformation then uses these tokens in the lookup operation. The match index is a table in a SQL Server database. When the package runs again, the transformation can either use an existing match index or create a new index. If the reference table is static, the package can avoid the potentially expensive process of rebuilding the index for repeat sessions of data cleansing.
Click the Columns tab. Delete the mapping between the two CustomerKey columns. Clear the check box next to the CleanCustomerKey input column. Select the check box next to the CustomerKey lookup column. Rename the output alias for this column to CleanCustomerKey. You are replacing the original column with the one retrieved during the lookup. Your mappings should resemble those shown in the following screenshot:
Click the Advanced tab. Raise the Similarity threshold to 0.50 to reduce the matching search space. With similarity threshold of 0.00, you would get a full cross join. Click OK.
Drag the Union All transformation below the Fuzzy Lookup transformation. Connect it to an output of the Match Multicast transformation and an output of the FuzzyMatches Fuzzy Lookup transformation. You will combine the exact and approximate matches in a single row set.
Drag an OLE DB Destination below the Union All transformation. Rename it FuzzyMatchingResults and connect it with the Union All transformation. Double-click it to open the editor.
Connect to your DQS_STAGING_DATA database and select the dbo.FuzzyMatchingResults table. Click the Mappings tab. Click OK. The completed data flow is shown in the following screenshot:
You need to add restartability to your package. You will truncate all destination tables. Click the Control Flow tab. Drag the Execute T-SQL Statement task above the data flow task. Connect the tasks with the green precedence constraint from the Execute T-SQL Statement task to the data flow task. The Execute T-SQL Statement task must finish successfully before the data flow task starts.
Double-click the Execute T-SQL Statement task. Use the connection manager to your DQS_STAGING_DATA database. Enter the following code in the T-SQL statement textbox, and then click OK:
```
TRUNCATE TABLE dbo.CustomersDirtyMatch;
TRUNCATE TABLE dbo.CustomersDirtyNoMatch;
TRUNCATE TABLE dbo.FuzzyMatchingResults;
```
Save the solution. Execute your package in debug mode to test it. Review the results of the Fuzzy Lookup transformation in SSMS. Look for rows for which the transformation did not find a match, and for any incorrect matches. Use the following code:
```
-- Not matched
SELECT * FROM FuzzyMatchingResults
WHERE CleanCustomerKey IS NULL;
-- Incorrect matches
SELECT * FROM FuzzyMatchingResults
WHERE CleanCustomerKey <> CustomerKey * (-1);
```

You can use the following code to clean up the AdventureWorksDW2014 and DQS_STAGING_DATA databases:

USE AdventureWorksDW2014;
DROP TABLE IF EXISTS dbo.Chapter05Profiling;
DROP TABLE IF EXISTS dbo.AWCitiesStatesCountries;
USE DQS_STAGING_DATA;
DROP TABLE IF EXISTS dbo.CustomersCh05;
DROP TABLE IF EXISTS dbo.CustomersCh05DQS;
DROP TABLE IF EXISTS dbo.CustomersClean;
DROP TABLE IF EXISTS dbo.CustomersDirty;
DROP TABLE IF EXISTS dbo.CustomersDirtyMatch;
DROP TABLE IF EXISTS dbo.CustomersDirtyNoMatch;
DROP TABLE IF EXISTS dbo.CustomersDQSMatch;
DROP TABLE IF EXISTS dbo.DQSMatchingResults;
DROP TABLE IF EXISTS dbo.DQSSurvivorshipResults;
DROP TABLE IF EXISTS dbo.FuzzyMatchingResults;

When you are done, close SSMS and SSDT.

SQL Server data quality services (DQS) is a knowledge-driven data quality solution. This means that it requires you to maintain one or more knowledge bases (KBs). In a KB, you maintain all knowledge related to a specific portion of data—for example, customer data. The idea of data quality services is to mitigate the cleansing process. While the amount of time you need to spend on cleansing decreases, you will achieve higher and higher levels of data quality. While cleansing, you learn what types of errors to expect, discover error patterns, find domains of correct values, and so on. You don't throw away this knowledge. You store it and use it to find and correct the same issues automatically during your next cleansing process.