Talend ETL Sample Documentation
Talend ETL Sample Documentation
JasperETL is powered by Talend and uses Talend’s Data integration and OpenStudios features for ETl
purpose.
Talend MDM allows organizations to easily model and master any reference data, in any
domain without constraints. The unified data management platform unites Data Integration,
Data Quality, Master Data and Data Stewardship all through a single Eclipse-based
development environment.
Data Integration
Data Quality
Master Data Management
All Talend products are built on a unified Eclipse-based development environment, which
provides users with consistent ergonomics, fast learning curve and a high-level of reusability.
This offers unrivaled benefits in terms of resource optimization and utilization, and project
consistency.
Data Integration
Talend's data integration products include:
Talend Open Studio, the community version, provided under the GPL v2 license and
freely downloadable
Talend Integration Suite, the enterprise version, provided under a commercial
subscription license. Talend Integration Suite exists in 3 editions: Team Edition,
Professional Edition and Enterprise Edition
Talend On Demand, the Software as a Service version
Talend Integration Suite MPx, a massively parallel data integration platform
Talend Integration Suite RTx, a real-time data integration platform
Data Quality
Talend's data quality products include:
Talend Open Profiler, an open source data profiling tool provided under the GPL v2
license and freely downloadable
Talend Data Quality, the enterprise data quality platform that includes data profiling and
data cleansing features
Talend MDM Community Edition, an open source Master Data Management tool
provided under the GPL v2 license and freely downloadable
Talend MDM Enterprise Edition, the enterprise version, provided under a commercial
subscription licenset.
Here you will get two products Talend Server and Talend MDM. To run the application execute the
TMDMCE-win32-x86.exe under Talend MDM.
Create a local repository and a project based on the Language (Java / Perl) you suit with.
Over Here you can see the various windows such as:
1/. Repository
2/. Palette
4/. The Middle area is your working zone. Where you can create various jobs, Business Models etc...
1/. Repository
Repository is the Place where every Data is stores such as your Jobs, Business Models, MetaData
information and others.
Under Job Design you create various jobs regards to your Data Transformation requirements.
Under Metadata you can define and create various connections with your source data that can be a CSV
or a database or any other format of data.
2/. Palette
Palette provides you all the components that you can use while preparing youe Business Model
or Job for Data Transfer from source data location to Destination Data Location.
Here you have lots of components available for data Extraction, Transformation and Loading into the
Target Source.
Now we will see how we create a new Job into the System:
Creating Job:
Right Click on JOB DESIGN under Repository window and select Create Job it will open a popup.
Here you can provide the basic details of the Job like name Purpose and Description. Now it will
create a new job for you and open it in the workspace:
Now you can create various metadata items regards to your source and Destination data.
Click on next and browse the partner CSV you will get the data shown below that:
Now click on the next:
Here you can set various parameters regards to your CSV settings. Now click on Next:
Here you will get the description of the schema as fields of your CSV file. Here I have selected
Website as Key because I want Partner not to be duplicated and to avoid duplicate records in
the system based on the website. In general you can set any number of columns as key as per
your requirement. Now click on finish and you will get the partner_csv under File Delimited as
your source data.
Now we need to setup our destination database here I am using PostgreSQL Database.
So right click on DBConnection under MetaData and select Create Connection this will open a
popup, here you can provide the name of the connection, Click on next to provide the connection
details:
Here you can select the target database type and provide the connection settings. After filling the details
click on finish and you will get the connection under MetaData > DBConnections:
Now to retrieve the table schemas right click on your DB Connection and select retrieve schema:
Here you can select the schema type among TABLE , VIEW or SYNONYMs, here I have selected only
tables as I required only tables. You can use the SQL Queries as well to fetch your data. Now click next.
Here it will show you all the tables present in the database, so you can select the table which you want
and click next:
Here you can select the fields which you want for your data process and click on finish.
Now we need to add our CSV file in the job, so drag the CSV file into the job workspace, it will then
open a popup like this:
Select tFileInputDelimited as we want CSV as input source and click ok it will then create an item in job
workspace:
Now by double clicking on the component you can see the component properties in component
window:
Over here you can view all the settings regards to your CSV file and you can also edit the settings over
here as well.
Now add the destination source for the data as the table we have created in DBConnections , just
drag the table from there to job workspace. It will open a popup window:
Here select the tPostgresqlOutput as this is going to be the output of the data flow:
Same way you can view or edit the properties of the res_partner component under component window.
Now we need to add tMap component from palette window for mapping the input and output fields
in data flow so drag the tMap component from the palette window:
Now drag this into job workspace:
Now to filter duplicate records add tUniqRow component from the palette window:
Now we need to join the data flow from partner_csv to tMap that will be the input for tMap. For
this right click on the partner_csv and select row > main and connect to the tMap :
Now we need to take output from tMap to the tUniqRow. For this right click on tMap and select
row> newOutput(Main) and name the output connection.
Then it will ask for matching target schema then click yes.
Now we need to take output from tUniqRow to the res_partner. For this right click on tUniqRow
and select row> uniques and connect to res_partner.
Mapping of Data:
Now double click on the tMap icon to map the target and source data flow, it will open the map like
this:
Now we need to map the input fields to the output fields: so drag the related columns to the target
columns or either click on Auto Map button on right hand side top.
Now you can see that I have used one extra column active there as it is required for making partner
active and mandatory in destination database. It is a Boolean field so it will take default value as
TRUE/FLASE so I have written true before active column as you can see:
Now double click on res_partner to view the properties of this component under component
window:
Here you can view the basic connection settings. Here two fields are important:
1. Action on table: this defines how you connection will treat your table.
2. Action on Data: this defines what operation you are going to perform, it can be
insert / update or the combination of the both.
Now we are done with all the configuration and ready to run our job. So click on the Run window
near component window:
Before run check our CSV which we have created to import the Partners we have some duplicate
records.
Here you can see that we have some duplicate records that we will avoid by using tUniqRow
component.
Run the Job
Now you can run our job:
So this is how we use Talend for ETL purpose. For more reference you can check-out the help which is
available in detail at help menu in Talend.