0% found this document useful (0 votes)
16 views

Talend Examples DataQuality EN 7.2.1

Uploaded by

kunja4
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Talend Examples DataQuality EN 7.2.1

Uploaded by

kunja4
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Data Quality Job and

Analysis Examples

7.2.1
Contents

Copyright........................................................................................................................ 3

Profiling customer data................................................................................................4


Identifying data anomalies............................................................................................................................................. 4
Sharing analysis results: reports................................................................................................................................ 14

Cleansing data.............................................................................................................21
Removing duplicate values.......................................................................................................................................... 21
Removing non-matching values.................................................................................................................................22

Monitoring data evolution.........................................................................................23


Prerequisites to monitor data evolution................................................................................................................. 23
Generating a Job to run the report........................................................................................................................... 23
Creating an execution task and scheduling it...................................................................................................... 25
Deploying the task on the server.............................................................................................................................. 28
Copyright

Copyright
Adapted for 7.2.1. Supersedes previous releases.
Publication date: June 20, 2019
Copyright © 2019 Talend. All rights reserved.
The content of this document is correct at the time of publication.
However, more recent updates may be available in the online version that can be found on Talend
Help Center.
Notices
Talend is a trademark of Talend, Inc.
All brands, product names, company names, trademarks and service marks are the properties of their
respective owners.
End User License Agreement
The software described in this documentation is provided under Talend 's End User Software and
Subscription Agreement ("Agreement") for commercial products. By using the software, you are
considered to have fully understood and unconditionally accepted all the terms and conditions of the
Agreement.
To read the Agreement now, visit http://www.talend.com/legal-terms/us-eula?
utm_medium=help&utm_source=help_content

3
Profiling customer data

Profiling customer data


Incorporating appropriate data quality tools in your business processes is vital at the beginning of any
project and through the project plan in order to see what type of data quality you have and decide
how and what data to resolve.
Suppose, for example, that you want to start a campaign for your sales and marketing groups, or
you need to contact customers for billing and payment and your main source to contact appropriate
people is email and postal addresses. Having consistent and correct address data is vital in such
campaign to be able to reach all people.
This section provides an example of profiling US customer email and postal addresses.
It shows how to identify anomalies in address columns, how to use some Talend Jobs to recuperate
duplicate and non-match addresses and finally how to generate periodic evolution reports to keep
monitoring data evolution and share such statistics with business users.

Identifying data anomalies


The first step in this example is to profile the customer contact information in a MySQL database. The
profiling results provides you with statistics about the values within each column.

How to profile address columns


You will use the Profiling perspective of Talend Studio to analyze few customer columns including
email and postal.
Using out-of-box indicators and patterns on these columns, you can show in the analysis results the
matching and non-matching address data, the number of most frequent records for each distinct
pattern and the row, duplicate and blank counts in each column.

Defining the column analysis

Procedure
1. In the DQ Repository tree view, expand the Data Profiling folder.
2. Right-click the Analyses folder and select New Analysis.

The Create New Analysis wizard opens.

4
Profiling customer data

3. In the filter field, start typing basic column analysis, select Basic Column Analysis and click
Next.

4. In the Name field, enter a name for the current column analysis.

5
Profiling customer data

Note:
Avoid using special characters in the item names including:
"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".
These characters are all replaced with "_" in the file system and you may end up creating
duplicate items.

5. Set column analysis metadata (purpose, description and author name) in the corresponding fields
and click Next to proceed to the next step.

Selecting the address columns and setting sample data

Procedure
1. Expand DB connections and browse to the address columns you want to analyze.

6
Profiling customer data

2. Select the columns and click Finish to close the wizard.


A file for the newly created column analysis is listed under the Analysis node in the DQ Repository
tree view, and the analysis editor opens with the analysis metadata.

7
Profiling customer data

3. In the Data preview view, click Refresh Data.


The data in the selected columns is displayed in the table.
You can change your data source and your selected columns by using the New Connection and
Select Data buttons respectively.
4. In the Limit field, set to 50 the number for the data records you want to display in the table and
use as sample data.
5. Select n random rows to list 50 random records from the selected columns.

Setting system indicators

This column analysis uses out-of-box indicators to provide simple statistics such as row, blank and
duplicate counts on the Email and Phone columns.

Before you begin


• You have opened the Profiling perspective in the Studio.
• You have created a column analysis and defined the connection to the database. For further
information, see Defining a column analysis and Creating the database connection respectively.

8
Profiling customer data

Procedure
1. In the Data Preview section in the analysis editor, click Select indicators to open the Indicator
Selection dialog box.

2. Expand Simple Statistics and select Row Count, Blank Count and Duplicate Count. Click OK to
close the wizard.
You want to see the row, blank and duplicate counts in the Email and Phone columns to see
how consistent the data is.
Indicators are added accordingly to the columns in the Analyzed Columns section.

9
Profiling customer data

3. Click the icon next to the Duplicate Count and Blank Count indicator and set 0 in the Upper
threshold field.
Defining thresholds on the Email and Phone columns is very helpful as it will write in red the
count of the duplicate and blank values in the analysis results.

Setting patterns

This column analysis uses predefined patterns to match the content of the Email and Phone columns
against standard email and US phone patterns respectively. This defines the content, structure and
quality of emails and phone numbers and give a percentage of the data that match the standard
formats and the data that does not match.

Before you begin


• You have opened the Profiling perspective in the Studio.
• You have created a column analysis and defined the connection to the database. For further
information, see Defining a column analysis and Creating the database connection respectively.

Procedure
1. In the Data Preview section in the analysis editor, click the icon next to the Email column to
open the Pattern Selector dialog box.
2. Expand Regex > internet, select the Email Address check box and click OK to close the dialog box.
The pattern is added to the column in the Analyzed Columns section.
3. Click the icon next to the Phone column to open the Pattern Selector dialog box.
4. Expand Regex > phone, select the US phone numbers check box and click OK to close the dialog
box.

10
Profiling customer data

The pattern is added to the column in the Analyzed Columns section.


5. Click the icon next to the Email Address and US phone numbers patterns and set 98.0 in the
Lower threshold (%) fields.

If the number of the records that match the patterns is fewer than 98%, it will be written in red in
the analysis results.

Executing the analysis and displaying the profiling results

Procedure
1. Save the column analysis in the analysis editor and then press F6 to execute it.
A group of graphics is displayed in the Graphics panel to the right of the analysis editor showing
the results of the column analysis including those for pattern matching.
2. Click the Analysis Results tab at the bottom of the analysis editor to access a more detail result
view.
These results show the generated graphics for the analyzed columns accompanied with tables
that detail the statistic and pattern matching results.

11
Profiling customer data

Results

The pattern matching results show that about 10% of the email records do not match the standard
email pattern. The simple statistic results show that about 8% of the email records are blank and that
about 5% are duplicates. And the pattern frequency results give the number of most frequent records
for each distinct pattern. This shows that the data is not consistent and you need to correct and cleans
the email data before starting your campaign.
The results for the postal column look as the following:

12
Profiling customer data

The result sets for the postal column give the count of the records that match and those that do not
match a standard US zip code format. The results sets also give the blank and duplicate counts and
the number of most frequent records for each distinct pattern. These results show that the data is not
very consistent.
Then some percentage of the customers can not be contacted by either email or US mail service.
These results show clearly that your data is not very consistent and that it needs to be corrected.

How to view analyzed data


After running the column analysis using the SQL engine and from the Analysis Results view of the
analysis editor, you can right-click any of the rows/bars in the result tables/charts and access a view
of the actual analyzed data.
This could be very helpful to see invalid rows for example and start analyzing what needs to be done
to clean such data.

13
Profiling customer data

Procedure
1. At the bottom of the analysis editor, click the Analysis Results tab to open a detailed view of the
analysis results.
2. Right-click the data row in the statistic results of the email column and select View rows for
example.

Results
The Data Explorer perspective opens listing the invalid rows in the email column.

Sharing analysis results: reports


Talend DQ Portal is deprecated from Talend 7.1 onwards.
After profiling the email and zip code columns and getting the detail results about the structure and
consistency of the address data, you need to share these results with other business users.
You must first generate a report file on the analysis results from Talend Studio and save the report in
a data quality data mart.
Business users can then access the report from Talend DQ Portal, which is a web-based platform that
shares analysis results generated from the studio and saved in the data quality data mart.

Generating a report file from Talend Studio


Talend DQ Portal is deprecated from Talend 7.1 onwards.

Procedure
1. In the DQ Repository tree view, right-click the analysis name and select New Report.

14
Profiling customer data

The report editor is displayed with the selected analysis listed in the Analysis List.

2. In the Analysis list view and from the Template type list, select Evolution as the type for the
report you want to generate.
In this example, you want to generate an evolution report which provides information showing
the evolution through time of the indicators used on the email and postal columns. This report
allows you to compare current and historical statistics to determine the improvement or
degradation of the address data. Such information is vital to decide to intervene and resolve data
at the right time and thus monitor the quality of data on an on-going basis.
3. Select the Refresh All check box to refresh the listed analysis before generating the report.
4. In the Generated Report Settings view and from the File Type list, select to generate a PDF report
file.
5. In the Database Connection Settings view, set the connection parameters to the data mart where
you want to store the report results.

15
Profiling customer data

6. Click the Check button to verify if your connection is successful.


A message confirms if the database exists and if the connection is successful.
7. If the database structure does not exist, click OK in the message to let Talend Studio creates it for
you.
8. Click OK to close the confirmation message.
9. Save the report and click on the editor toolbar to generate the report file.

Results
A report file is generate and listed under the Reports node in the DQ Repository tree view. The report
shows the evolution through time of the simple statistics indicators and the patterns used on the
email and postal columns.
Below are the results of the email column:

16
Profiling customer data

This chart shows that 89.80% of the email addresses are valid right now.

17
Profiling customer data

For the simple statistics indicators, there are two charts: the first indicates the change in the statistics
and the second indicates the percentage of that change.
Generating this report repeatedly will give a flat line if there is no change in data. The line will start
to go upwards if data is fixed and downwards if data gets less accurate and consistent.
For further information on reports, see the Reports chapter in Talend Studio User Guide at https://
help.talend.com.
After generating this report in Talend Studio, business users can access it from Talend DQ Portal.

Generating an evolution report from Talend DQ Portal


Talend DQ Portal is deprecated from Talend 7.1 onwards.

Procedure
1. Access Talend DQ Portal using tdq_user as username and tdq as password.

18
Profiling customer data

2. Click the User menu and slide the cursor on Reports > Column Report > Column Report.
3. Click the Report explore icon.
A dialog box opens to list all evolution reports generated on column analyses in the Profiling
perspective. This list shows first the name of the report and then the name of the column analysis.

4. Select the check box of the evolution report you want to generate and then click Confirm at the
bottom right corner of the dialog box.
5. Click Execute at the top of the Parameters panel.

Results
A loading indicator is displayed and then the report is open in the page.
You will have in Talend DQ Portal the same profiling results you generated from Talend Studio:

19
Profiling customer data

20
Cleansing data

Cleansing data
After profiling customer data and identifying its problems, some actions should be taken on data
to cleans it. You may start by generating two Talend Jobs: one to remove duplicates from the email
column and the other to remove the values that do not match the email pattern.
This will help you see what to resolve and then you can decide what tool to use to intervene and
resolve these address issues.

Removing duplicate values


After analyzing the email and postal columns using simple statistics indicators, the analysis results
show the number of duplicate records in the columns. You can generate a ready-to-use Job on the
analysis results. This Job removes duplicate values in the selected column.
You can follow the same procedure to remove duplicates from the Email or Phone columns.

Procedure
1. In the Profiling perspective, click Analysis Results at the bottom of the editor.
2. In the Simple Statistics results of the Email or Phone column, right-click the duplicate count bar in
the chart and select Remove duplicates.
This example uses the outcome of the simple statistics used on the Email column.
The Integration perspective opens showing the generated Job.

The database input component and the tUniqueRow component are already configured according
to your connection and the columns you are analyzing.
3. Save the Job and press F6 to execute it.

Results
Duplicate values are written to the specified output database and file.

What to do next
You can follow the same procedure to remove duplicates from the postal column.

21
Cleansing data

For further information on using the Profiling Profiling perspective to identify and remove corrupt,
incomplete or inaccurate data, see the Data Cleansing chapter in Talend Studio User Guide at https://
help.talend.com.

Removing non-matching values


The email pattern used on the email column showed that some records do not respect the standard
email format. You can generate a ready-to-use Job to recuperate the non-matching rows from the
column.

Procedure
1. In the Profiling perspective, click the Analysis Results tab at the bottom of the editor.
2. In the Pattern Matching results of the email column, right-click the chart bar or the numerical
results and select Generate Job.
The Integration perspective opens showing the generated Job.

This Job uses the Extract Transform Load process to write in two separate output files the valid/
invalid email rows that match/do not match the pattern.
3. Save the Job and press F6 to execute it.

Results
The valid and invalid rows of the email column are written in the defined output files.
You can replace the output files with different Talend components and recuperate the valid/invalid
email rows and write them in databases for example.
For further information on using the Profiling perspective to identify and remove corrupt,
incomplete or inaccurate data, see the Data Cleansing chapter in Talend Studio User Guide at https://
help.talend.com.

22
Monitoring data evolution

Monitoring data evolution


Talend DQ Portal is deprecated from Talend 7.1 onwards.

To keep track of the quality of the address data you initially profiled, you can simply use the Talend
Studio to generate a Job which can launch the evolution report you created on the column analysis.
You can then deploy this Job on Talend Administration Center and schedule it to run monthly, for
example.
This way the report generated from Talend Studio will be launched remotely from Talend
Administration Center and business users can access the results from Talend DQ Portal. These results
will help you to see what needs to be corrected in the data.

Prerequisites to monitor data evolution


Procedure
1. Create a data quality project from Talend Administration Center Web Application, dq_proj for
example.
For further information about how to create a project in the Talend Administration Center Web
Application, see Talend Administration Center User Guide at https://help.talend.com.

Note: If you do not have the required rights to create or manage a project, contact the
administrator of your Web Application.

2. Establish a connection from your Talend Studio to the remote SVN repository storing the above-
mentioned project.
For further information about how to connect to a remote repository, see the Getting Started
Guide at https://help.talend.com.
3. Launch Talend Studio using this new connection and the new data quality project.

Generating a Job to run the report


In this section you will see how to generate a ready-to-use Job to launch the report you created on
the column analysis.

Procedure
1. In the Profiling perspective of Talend Studio, right-click the profile_customers report under the
Report node and select Generate Jobs > Launch a report.
This generates a launching-report Job and opens it in the Integration perspective.

23
Monitoring data evolution

The Report filenames field points to the technical path of the report, and the Output Folder field
points to the folder where to save the report file.
2. Press F6 to execute the Job from Talend Studio for testing purposes.

Results
The Talend Studio console shows information about the report including the report identification, its
name and its execution time.

24
Monitoring data evolution

Creating an execution task and scheduling it


This section describes how to create an execution task in Talend Administration Center in order to
execute the launch-report Job created in Talend Studio. It also describes how to define a trigger to
launch the execution task once per month.

Procedure
1. Connect to the Talend Administration Center Web Application.
2. In the menu tree view of the Web Application, expand Conductor and click Job Conductor to
display the execution task list.
3. From the toolbar on the Job Conductor page, clickAdd to clear the Execution task configuration
panel.

25
Monitoring data evolution

4. On this configuration panel, set the parameters required for executing the launch-report Job as
the following:
a) In the Label field, type in the task name.
b) In the Project field, select the data quality project in which the launch-report Job was created.
c) In the Branch field, select trunk as the branch of this project.
d) From the Name list, select the launch-report Job to be used.
e) In the Version list, select the Job version you want to launch; and in the Context field, select
the context in which to run the Job.
f) From the Execution server list, select the server which you want to use to execute this task.
5. Click Save to validate the configuration of this execution task.
The new task is displayed in the Job Conductor page under the data quality project.

26
Monitoring data evolution

6. From the task list, select the newly added task, click Triggers.
7. Click Add trigger > Add CRON trigger.
8. In the Cron Trigger configuration panel, fill in a name for the trigger and click Open UI configurer.

9. Select the minute, hour and date at which to execute the task and click Apply modifications.
The selected data is displayed in the trigger configuration panel.

27
Monitoring data evolution

This trigger means that the evolution report will be re-generated at 3:15pm of the first day of
each month.
10. Click Save.

Deploying the task on the server


Talend DQ Portal is deprecated from Talend 7.1 onwards.

Procedure
1. Select the task in the Job Conductor page and click Generate on the toolbar.
2. Once the status of this task reads Ready to send on the task list, select the task again and
click Deploy on the toolbar.
3. Once the status of this task reads Ready to run on the task list, select the task again and click
Run on the toolbar.

28
Monitoring data evolution

This task will automatically run the launch-report Job from Talend Administration Center Web
Application on monthly basis. The column analysis listed in the evolution report is executed and
its results are saved in the data quality data mart and the report file is saved in the output folder
defined in the tDqReportRun basic settings.
Generating this evolution report repeatedly will track data changes in the address columns
initially profiles. The line in the report will start to go upwards if data is fixed and downwards if
data gets less accurate and consistent.p
All business users can access this report from Talend DQ Portal and see the evolution of address
data over time.
Below is an example of the evolution report that can be accessed by Talend DQ Portal:

This report shows the evolution of data in the email column from October 2012 till June 2013.
Some improvement were done to email records in February 2013 followed by some degradation
of data, then there was no change in data quality till June 2013.

29

You might also like