0% found this document useful (0 votes)
93 views25 pages

Intro To Talend Open Studio For Data Integration: Philip Yurchuk

This document provides an overview of Talend Open Studio for Data Integration. It describes Talend as an Eclipse-based visual programming tool that generates Java code for ETL (extract, transform, load) jobs that can run standalone or embedded. It details the over 600 components and 450 connectors that allow extracting and loading data from various data sources. The document demonstrates some features and provides tips, and recommends additional resources for learning more about Talend.

Uploaded by

abhishek Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPSX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views25 pages

Intro To Talend Open Studio For Data Integration: Philip Yurchuk

This document provides an overview of Talend Open Studio for Data Integration. It describes Talend as an Eclipse-based visual programming tool that generates Java code for ETL (extract, transform, load) jobs that can run standalone or embedded. It details the over 600 components and 450 connectors that allow extracting and loading data from various data sources. The document demonstrates some features and provides tips, and recommends additional resources for learning more about Talend.

Uploaded by

abhishek Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPSX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Intro to

Talend Open Studio


for
Data Integration
Philip Yurchuk
http://philip.yurchuk.com
What is Talend?
Eclipse-based visual
programming editor
Generates executable Java code
Jobs can run standalone or
embedded (no special server)
Batch or interactive (user input)
What is ETL?

Extract: suck up data


Transform: mess with it
Load: blow it out
Batch, integration,
migration, etc.
Extract from/load to where?
Over 600 components
Over 450 connectors
Allows multiple
inputs/outputs in single job
Connectors
Flat files Applications/Platforms
Delimted (tab, CSV) Alfresco
XML Microsoft Dynamics
JSON (CRM, AX)
Excel SAP

Positional Sage ERP X3

Apache HTTP logs, Salesforce


HL7... SugarCRM
Connectors (continued)
Relational Databases NoSQL/Columnar/OLAP/
Other
MySQL Amazon RedShift
Postgresql Greenplum
Hive
MS SQL OLAP cubes
Oracle LDAP
VectorWise
Many more Teradata
More in Big Data ed.
How do we transport data?
File system
FTP
SFTP/SCP
Web service (SOAP, REST)
HTTP
Mail, POP
XMLRPC, Sockets, JMS, RSS...
Other Components
Process data: join, filter, aggregate
Flow control: loops, job invocation
Logs, statistics
Code: Java, Groovy
On row data or standalone
Can load libraries
Demo
Nifty Components
FuzzyMatch - calculate Levenshtein distance or
phonetic similarity
IntervalMatch perform lookup/join based on
values falling within an interval
Replace, ReplaceList - search and replace,
substitution
UniqRow - output distinct rows based on defined
key columns
More Nifty Components
XMLMap - Allows joins, column or row filtering,
transformations, and multiple outputs
Normalize/Denormalize - split delimited strings
into columns or join columns into a string
AggregateRow GROUP BY; min, max, sum,
other functions used to aggregate rows on a column
Tips and Tricks
CamelCase job names for embedded jobs.
Or prefix with ETL phase and order of execution
Whenever appropriate (esp. for inserting data),
use the schema from the repository.
When connecting, propagating changes to a DB
component will change it to a built-in schema,
which won't get updated.
Tips and Tricks
Propagating changes to a DB component will
change it to a built-in schema, which won't get
updated after repo changes.
On the other hand, remember that for
lookup/join (i.e., SELECT) queries you can
modify the query to only select the fields you
need. Propagating the schema is useful then.
Tips and Tricks
Failure handling subjob:
Its an unconnected job (no triggers point to it)
Use LogCatcher to catch, record component failures.
Record failure in DB, file, email, etc.
Add rollback component to undo DB changes if
necessary. May need to do this in the job if strategic
placement is needed.
Tips and Tricks
In Java expressions, use methods, not
operators. E.g., concat(String) instead of the dot
operator, equals(Object) instead of ==.
Technical components (like hash maps) are
hidden by default. See:
http://www.talendforge.org/forum/viewtopic.p
hp?pid=110860
Tips and Tricks
When connecting, propagating changes to a DB
component will change it to a built-in schema,
which won't get updated after repo changes.
On the other hand, remember that for
lookup/join (i.e., SELECT) queries you can
modify the query to only select the fields you
need. Propagating the schema is useful then.
Tips and Tricks
Use a context for job variables.
Note you can specify type for variables.
You can read from a file or database, or
pass in a context if an embedded Java
job.
Tips and Tricks
For multi-host deployment:
Export the job with a bootstrap context that has all variables,
but populates only a context config location that is the same
for all machines.
The context config file has all values required for that host, e.g.
test DB connection for test machine.
You can rely on the fact that Windows will interpret root as the
main system drive, so /Data/ will translate to C:\Data\
Be mindful of file permissions for sensitive context data (e.g.,
DB password)
Tips and Tricks
Use Bulk output components when possible.
For transactional behavior:
Start the job with DB connection
Check use existing connection in all relevant
components
Check "Die on error" in all relevant components
End job with commit component
Room for Improvement
UI stability
Documentation
Books
Getting Started with Talend Open Studio
for Data Integration by Bowen Jonathan
Talend Open Studio Cookbook by Rick
Daniel Barton
Big Data book coming
Talend Forge
http://www.talendforge.org/
Forum super helpful
Exchange free community components!
Tutorials
Bug tracker
Source code
Talend Resources
http://www.talend.com/resources
Help Center
Knowledge Base
Webinars, screencasts
Tutorials
Docs are on download page
And by pressing F1 on a component
Questions?
Compliments?
Consulting gigs?
Contact me:
[email protected]
http://philip.yurchuk.com
http://www.linkedin.com/in/philipyurchuk/
Thank You!

You might also like