Exercise 2: Cleansing Data With Integration Services: Creating The DQS Connection Manager
Exercise 2: Cleansing Data With Integration Services: Creating The DQS Connection Manager
Exercise 2: Cleansing Data With Integration Services: Creating The DQS Connection Manager
In this exercise, you will cleanse the Office dataset with Integration Services by using the
knowledge base enhanced in Lab 2-1.
5. In the Project Password window, in the Password box, enter Pass@word1. (Do not enter
the period.)
6. Click OK.
8. In the Add SSIS Connection Manager window, select the DQS connection manager type.
9. Click Add.
1. In Solution Explorer, right-click the SSIS Packages folder, and then select
New SSIS Package.
3. To rename the package, in Solution Explorer, right-click the Package1.dtsx file, and then
select Rename.
The output of the cleansing will be split into correct and invalid outputs. Correct data will be
loaded into the DimOffice table, and invalid data will be loaded into the DimOffice_Error table.
2. To add a data flow task, click the link located at the center of the designer.
3. To design the data flow, from the SSIS Toolbox (located at the left), expand
Other Sources, and then drag the ADO NET Source to the data flow designer.
If the SSIS Toolbox is not open, on the SSIS menu, select SSIS Toolbox.
5. Verify that the data flow component looks like the following.
Do not be concerned about the error icon, which will disappear when you complete the next
steps.
6. To edit the source component, right-click the component, and then select Edit.
7. In the ADO.NET Source Editor window, in the ADO.NET Connection Manager dropdown
list, notice that the localhost.Lab connection manager is selected.
This is the same dataset you used to data profile in Lab 1-1, perform knowledge discovery
with in Lab 1-2, and source data in the Data Quality Project in Lab 2-1.
9. Click OK.
10. From the SSIS Toolbox, expand Other Transforms, and then drag the DQS Cleansing to
the data flow designer, and drop it directly beneath the source component.
12. To connect the components, first select the Office Dataset source component, and then
drag the standard output (the left, blue arrow) on top of the cleansing component.
13. Verify that the data flow design looks like the following.
14. To edit the cleansing component, right-click the component, and then select Edit.
17. In the Available Domains list, review the knowledge base domains, noticing that the first
listed in the composite domain.
You will not use the composite domain to cleanse that data in this package design.
This grid lists of input columns received from the source component.
21. Notice the second grid that defines the mapping between input columns and the
knowledge base domains.
It also defines alias output columns for the source, output and status columns.
22. Set the Office input column to map to the Office domain.
For your knowledge base, this will mean that StateOrProvince values will be set to upper
case, and ManagerEmail values will be set to lower case.
The reason needs to be output to help explain why values are invalid.
29. From the SSIS Toolbox, from inside the Common group, drag the Conditional Split to
the data flow designer, and drop it directly beneath the cleansing component.
30. Configure the standard output of the cleansing component to connect to the new
component, as follows.
34. Scroll to the bottom of the columns list, and then drag the Record Status column into the
Condition box.
35. In the Condition box, complete the expression as follows (note that the operator is two
equals (=) signs, which tests for equality).
Any record with an invalid record status will be output to the Invalid output.
39. From the SSIS Toolbox, expand Other Destinations (the last group), and then drag the
ADO NET Destination to the data flow designer, and drop it beneath, and to the left of,
the conditional split component
41. Configure the standard output of the conditional split component to connect to the new
component.
44. Verify that the data flow design looks like the following.
46. In the ADO.NET Destination Editor window, in the Connection Manager dropdown list,
notice that the localhost.Lab connection manager is selected.
This page of the editor is used to configure the mappings between the input columns, and the
columns of the DimOffice table.
49. To widen the list, drag the right edge of the Available Input Columns list, and drag open
the Name column to reveal the full column names.
There is no need to map to the OfficeKey column, as this is an identity column that will
automatically populate a sequence of values when rows are inserted into the table.
The source columns will contain original values, while the output columns will contain
standardized column (i.e. lower case email addresses), so you will map only the output
columns.
There is no need to store other column types as the rows passed to this destination are only
correct records. Status columns will only ever be Correct or Corrected.
Tip: You can also configure the mappings by selecting the input columns in the lower grid.
55. Add a second ADO.NET destination component, and then rename it DimOffice_Error.
56. Connect the conditional split component to the new destination component.
59. In the ADO.NET Destination Editor window, in the Connection Manager dropdown list,
notice that the localhost.Lab connection manager is selected.
62. Notice that the mappings to this table are automatically created.
Mappings are created automatically when there are matching column names and data types
between the two tables.
As this table will be used to analyze data quality issues, all output columns will be stored.
4. To stop the package debugging, on the Debug menu, select Stop Debugging.
5. To close SQL Server Data Tools, on the File menu, select Exit.
4. To sort the activities by descending order, in the activity grid, click the ID column header
twice.
Every activity undertaken with the Data Quality Server—even when invoked by SSIS—is
logged and remains available for review and audit.
1. To open a Data Quality Project, in the Data Quality Projects panel, click
Open Data Quality Project.
2. In the project grid, right-click the SSIS cleansing project, and then select Open.
3. Notice that the project opens at the Manage and View Results step.
It is not possible to go back to earlier steps, however it is possible to interactively correct the
data, and then export it as you did in Lab 2-1. You will not do this in this lab.
It is very important that you execute the script in the manner intended. Many script files
include multiple batches of statements (completed with the GO keyword), and so you should
select the statements together with the GO keyword, and then execute only that selection.
To execute a subset of a script, select the text you intend to execute, and then click Execute
(or press F5).
9. Select and execute the only query in the batch (lines 4-5).
10. Read the commented text, and then execute the query for each of the remaining batches
in the script.
11. To exit SQL Server Management Studio, on the File menu, select Exit.
You have now completed the lab. If you are not commencing the next lab, you should
complete the Finishing Up exercise to shut down and stop the VM.