Talend Examples DataQuality EN 7.2.1
Talend Examples DataQuality EN 7.2.1
Analysis Examples
7.2.1
Contents
Copyright........................................................................................................................ 3
Cleansing data.............................................................................................................21
Removing duplicate values.......................................................................................................................................... 21
Removing non-matching values.................................................................................................................................22
Copyright
Adapted for 7.2.1. Supersedes previous releases.
Publication date: June 20, 2019
Copyright © 2019 Talend. All rights reserved.
The content of this document is correct at the time of publication.
However, more recent updates may be available in the online version that can be found on Talend
Help Center.
Notices
Talend is a trademark of Talend, Inc.
All brands, product names, company names, trademarks and service marks are the properties of their
respective owners.
End User License Agreement
The software described in this documentation is provided under Talend 's End User Software and
Subscription Agreement ("Agreement") for commercial products. By using the software, you are
considered to have fully understood and unconditionally accepted all the terms and conditions of the
Agreement.
To read the Agreement now, visit http://www.talend.com/legal-terms/us-eula?
utm_medium=help&utm_source=help_content
3
Profiling customer data
Procedure
1. In the DQ Repository tree view, expand the Data Profiling folder.
2. Right-click the Analyses folder and select New Analysis.
4
Profiling customer data
3. In the filter field, start typing basic column analysis, select Basic Column Analysis and click
Next.
4. In the Name field, enter a name for the current column analysis.
5
Profiling customer data
Note:
Avoid using special characters in the item names including:
"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".
These characters are all replaced with "_" in the file system and you may end up creating
duplicate items.
5. Set column analysis metadata (purpose, description and author name) in the corresponding fields
and click Next to proceed to the next step.
Procedure
1. Expand DB connections and browse to the address columns you want to analyze.
6
Profiling customer data
7
Profiling customer data
This column analysis uses out-of-box indicators to provide simple statistics such as row, blank and
duplicate counts on the Email and Phone columns.
8
Profiling customer data
Procedure
1. In the Data Preview section in the analysis editor, click Select indicators to open the Indicator
Selection dialog box.
2. Expand Simple Statistics and select Row Count, Blank Count and Duplicate Count. Click OK to
close the wizard.
You want to see the row, blank and duplicate counts in the Email and Phone columns to see
how consistent the data is.
Indicators are added accordingly to the columns in the Analyzed Columns section.
9
Profiling customer data
3. Click the icon next to the Duplicate Count and Blank Count indicator and set 0 in the Upper
threshold field.
Defining thresholds on the Email and Phone columns is very helpful as it will write in red the
count of the duplicate and blank values in the analysis results.
Setting patterns
This column analysis uses predefined patterns to match the content of the Email and Phone columns
against standard email and US phone patterns respectively. This defines the content, structure and
quality of emails and phone numbers and give a percentage of the data that match the standard
formats and the data that does not match.
Procedure
1. In the Data Preview section in the analysis editor, click the icon next to the Email column to
open the Pattern Selector dialog box.
2. Expand Regex > internet, select the Email Address check box and click OK to close the dialog box.
The pattern is added to the column in the Analyzed Columns section.
3. Click the icon next to the Phone column to open the Pattern Selector dialog box.
4. Expand Regex > phone, select the US phone numbers check box and click OK to close the dialog
box.
10
Profiling customer data
If the number of the records that match the patterns is fewer than 98%, it will be written in red in
the analysis results.
Procedure
1. Save the column analysis in the analysis editor and then press F6 to execute it.
A group of graphics is displayed in the Graphics panel to the right of the analysis editor showing
the results of the column analysis including those for pattern matching.
2. Click the Analysis Results tab at the bottom of the analysis editor to access a more detail result
view.
These results show the generated graphics for the analyzed columns accompanied with tables
that detail the statistic and pattern matching results.
11
Profiling customer data
Results
The pattern matching results show that about 10% of the email records do not match the standard
email pattern. The simple statistic results show that about 8% of the email records are blank and that
about 5% are duplicates. And the pattern frequency results give the number of most frequent records
for each distinct pattern. This shows that the data is not consistent and you need to correct and cleans
the email data before starting your campaign.
The results for the postal column look as the following:
12
Profiling customer data
The result sets for the postal column give the count of the records that match and those that do not
match a standard US zip code format. The results sets also give the blank and duplicate counts and
the number of most frequent records for each distinct pattern. These results show that the data is not
very consistent.
Then some percentage of the customers can not be contacted by either email or US mail service.
These results show clearly that your data is not very consistent and that it needs to be corrected.
13
Profiling customer data
Procedure
1. At the bottom of the analysis editor, click the Analysis Results tab to open a detailed view of the
analysis results.
2. Right-click the data row in the statistic results of the email column and select View rows for
example.
Results
The Data Explorer perspective opens listing the invalid rows in the email column.
Procedure
1. In the DQ Repository tree view, right-click the analysis name and select New Report.
14
Profiling customer data
The report editor is displayed with the selected analysis listed in the Analysis List.
2. In the Analysis list view and from the Template type list, select Evolution as the type for the
report you want to generate.
In this example, you want to generate an evolution report which provides information showing
the evolution through time of the indicators used on the email and postal columns. This report
allows you to compare current and historical statistics to determine the improvement or
degradation of the address data. Such information is vital to decide to intervene and resolve data
at the right time and thus monitor the quality of data on an on-going basis.
3. Select the Refresh All check box to refresh the listed analysis before generating the report.
4. In the Generated Report Settings view and from the File Type list, select to generate a PDF report
file.
5. In the Database Connection Settings view, set the connection parameters to the data mart where
you want to store the report results.
15
Profiling customer data
Results
A report file is generate and listed under the Reports node in the DQ Repository tree view. The report
shows the evolution through time of the simple statistics indicators and the patterns used on the
email and postal columns.
Below are the results of the email column:
16
Profiling customer data
This chart shows that 89.80% of the email addresses are valid right now.
17
Profiling customer data
For the simple statistics indicators, there are two charts: the first indicates the change in the statistics
and the second indicates the percentage of that change.
Generating this report repeatedly will give a flat line if there is no change in data. The line will start
to go upwards if data is fixed and downwards if data gets less accurate and consistent.
For further information on reports, see the Reports chapter in Talend Studio User Guide at https://
help.talend.com.
After generating this report in Talend Studio, business users can access it from Talend DQ Portal.
Procedure
1. Access Talend DQ Portal using tdq_user as username and tdq as password.
18
Profiling customer data
2. Click the User menu and slide the cursor on Reports > Column Report > Column Report.
3. Click the Report explore icon.
A dialog box opens to list all evolution reports generated on column analyses in the Profiling
perspective. This list shows first the name of the report and then the name of the column analysis.
4. Select the check box of the evolution report you want to generate and then click Confirm at the
bottom right corner of the dialog box.
5. Click Execute at the top of the Parameters panel.
Results
A loading indicator is displayed and then the report is open in the page.
You will have in Talend DQ Portal the same profiling results you generated from Talend Studio:
19
Profiling customer data
20
Cleansing data
Cleansing data
After profiling customer data and identifying its problems, some actions should be taken on data
to cleans it. You may start by generating two Talend Jobs: one to remove duplicates from the email
column and the other to remove the values that do not match the email pattern.
This will help you see what to resolve and then you can decide what tool to use to intervene and
resolve these address issues.
Procedure
1. In the Profiling perspective, click Analysis Results at the bottom of the editor.
2. In the Simple Statistics results of the Email or Phone column, right-click the duplicate count bar in
the chart and select Remove duplicates.
This example uses the outcome of the simple statistics used on the Email column.
The Integration perspective opens showing the generated Job.
The database input component and the tUniqueRow component are already configured according
to your connection and the columns you are analyzing.
3. Save the Job and press F6 to execute it.
Results
Duplicate values are written to the specified output database and file.
What to do next
You can follow the same procedure to remove duplicates from the postal column.
21
Cleansing data
For further information on using the Profiling Profiling perspective to identify and remove corrupt,
incomplete or inaccurate data, see the Data Cleansing chapter in Talend Studio User Guide at https://
help.talend.com.
Procedure
1. In the Profiling perspective, click the Analysis Results tab at the bottom of the editor.
2. In the Pattern Matching results of the email column, right-click the chart bar or the numerical
results and select Generate Job.
The Integration perspective opens showing the generated Job.
This Job uses the Extract Transform Load process to write in two separate output files the valid/
invalid email rows that match/do not match the pattern.
3. Save the Job and press F6 to execute it.
Results
The valid and invalid rows of the email column are written in the defined output files.
You can replace the output files with different Talend components and recuperate the valid/invalid
email rows and write them in databases for example.
For further information on using the Profiling perspective to identify and remove corrupt,
incomplete or inaccurate data, see the Data Cleansing chapter in Talend Studio User Guide at https://
help.talend.com.
22
Monitoring data evolution
To keep track of the quality of the address data you initially profiled, you can simply use the Talend
Studio to generate a Job which can launch the evolution report you created on the column analysis.
You can then deploy this Job on Talend Administration Center and schedule it to run monthly, for
example.
This way the report generated from Talend Studio will be launched remotely from Talend
Administration Center and business users can access the results from Talend DQ Portal. These results
will help you to see what needs to be corrected in the data.
Note: If you do not have the required rights to create or manage a project, contact the
administrator of your Web Application.
2. Establish a connection from your Talend Studio to the remote SVN repository storing the above-
mentioned project.
For further information about how to connect to a remote repository, see the Getting Started
Guide at https://help.talend.com.
3. Launch Talend Studio using this new connection and the new data quality project.
Procedure
1. In the Profiling perspective of Talend Studio, right-click the profile_customers report under the
Report node and select Generate Jobs > Launch a report.
This generates a launching-report Job and opens it in the Integration perspective.
23
Monitoring data evolution
The Report filenames field points to the technical path of the report, and the Output Folder field
points to the folder where to save the report file.
2. Press F6 to execute the Job from Talend Studio for testing purposes.
Results
The Talend Studio console shows information about the report including the report identification, its
name and its execution time.
24
Monitoring data evolution
Procedure
1. Connect to the Talend Administration Center Web Application.
2. In the menu tree view of the Web Application, expand Conductor and click Job Conductor to
display the execution task list.
3. From the toolbar on the Job Conductor page, clickAdd to clear the Execution task configuration
panel.
25
Monitoring data evolution
4. On this configuration panel, set the parameters required for executing the launch-report Job as
the following:
a) In the Label field, type in the task name.
b) In the Project field, select the data quality project in which the launch-report Job was created.
c) In the Branch field, select trunk as the branch of this project.
d) From the Name list, select the launch-report Job to be used.
e) In the Version list, select the Job version you want to launch; and in the Context field, select
the context in which to run the Job.
f) From the Execution server list, select the server which you want to use to execute this task.
5. Click Save to validate the configuration of this execution task.
The new task is displayed in the Job Conductor page under the data quality project.
26
Monitoring data evolution
6. From the task list, select the newly added task, click Triggers.
7. Click Add trigger > Add CRON trigger.
8. In the Cron Trigger configuration panel, fill in a name for the trigger and click Open UI configurer.
9. Select the minute, hour and date at which to execute the task and click Apply modifications.
The selected data is displayed in the trigger configuration panel.
27
Monitoring data evolution
This trigger means that the evolution report will be re-generated at 3:15pm of the first day of
each month.
10. Click Save.
Procedure
1. Select the task in the Job Conductor page and click Generate on the toolbar.
2. Once the status of this task reads Ready to send on the task list, select the task again and
click Deploy on the toolbar.
3. Once the status of this task reads Ready to run on the task list, select the task again and click
Run on the toolbar.
28
Monitoring data evolution
This task will automatically run the launch-report Job from Talend Administration Center Web
Application on monthly basis. The column analysis listed in the evolution report is executed and
its results are saved in the data quality data mart and the report file is saved in the output folder
defined in the tDqReportRun basic settings.
Generating this evolution report repeatedly will track data changes in the address columns
initially profiles. The line in the report will start to go upwards if data is fixed and downwards if
data gets less accurate and consistent.p
All business users can access this report from Talend DQ Portal and see the evolution of address
data over time.
Below is an example of the evolution report that can be accessed by Talend DQ Portal:
This report shows the evolution of data in the email column from October 2012 till June 2013.
Some improvement were done to email records in February 2013 followed by some degradation
of data, then there was no change in data quality till June 2013.
29