Intro To Talend Open Studio For Data Integration: Philip Yurchuk
This document provides an overview of Talend Open Studio for Data Integration. It describes Talend as an Eclipse-based visual programming tool that generates Java code for ETL (extract, transform, load) jobs that can run standalone or embedded. It details the over 600 components and 450 connectors that allow extracting and loading data from various data sources. The document demonstrates some features and provides tips, and recommends additional resources for learning more about Talend.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPSX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
93 views25 pages
Intro To Talend Open Studio For Data Integration: Philip Yurchuk
This document provides an overview of Talend Open Studio for Data Integration. It describes Talend as an Eclipse-based visual programming tool that generates Java code for ETL (extract, transform, load) jobs that can run standalone or embedded. It details the over 600 components and 450 connectors that allow extracting and loading data from various data sources. The document demonstrates some features and provides tips, and recommends additional resources for learning more about Talend.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPSX, PDF, TXT or read online on Scribd
You are on page 1/ 25
Intro to
Talend Open Studio
for Data Integration Philip Yurchuk http://philip.yurchuk.com What is Talend? Eclipse-based visual programming editor Generates executable Java code Jobs can run standalone or embedded (no special server) Batch or interactive (user input) What is ETL?
Extract: suck up data
Transform: mess with it Load: blow it out Batch, integration, migration, etc. Extract from/load to where? Over 600 components Over 450 connectors Allows multiple inputs/outputs in single job Connectors Flat files Applications/Platforms Delimted (tab, CSV) Alfresco XML Microsoft Dynamics JSON (CRM, AX) Excel SAP
Positional Sage ERP X3
Apache HTTP logs, Salesforce
HL7... SugarCRM Connectors (continued) Relational Databases NoSQL/Columnar/OLAP/ Other MySQL Amazon RedShift Postgresql Greenplum Hive MS SQL OLAP cubes Oracle LDAP VectorWise Many more Teradata More in Big Data ed. How do we transport data? File system FTP SFTP/SCP Web service (SOAP, REST) HTTP Mail, POP XMLRPC, Sockets, JMS, RSS... Other Components Process data: join, filter, aggregate Flow control: loops, job invocation Logs, statistics Code: Java, Groovy On row data or standalone Can load libraries Demo Nifty Components FuzzyMatch - calculate Levenshtein distance or phonetic similarity IntervalMatch perform lookup/join based on values falling within an interval Replace, ReplaceList - search and replace, substitution UniqRow - output distinct rows based on defined key columns More Nifty Components XMLMap - Allows joins, column or row filtering, transformations, and multiple outputs Normalize/Denormalize - split delimited strings into columns or join columns into a string AggregateRow GROUP BY; min, max, sum, other functions used to aggregate rows on a column Tips and Tricks CamelCase job names for embedded jobs. Or prefix with ETL phase and order of execution Whenever appropriate (esp. for inserting data), use the schema from the repository. When connecting, propagating changes to a DB component will change it to a built-in schema, which won't get updated. Tips and Tricks Propagating changes to a DB component will change it to a built-in schema, which won't get updated after repo changes. On the other hand, remember that for lookup/join (i.e., SELECT) queries you can modify the query to only select the fields you need. Propagating the schema is useful then. Tips and Tricks Failure handling subjob: Its an unconnected job (no triggers point to it) Use LogCatcher to catch, record component failures. Record failure in DB, file, email, etc. Add rollback component to undo DB changes if necessary. May need to do this in the job if strategic placement is needed. Tips and Tricks In Java expressions, use methods, not operators. E.g., concat(String) instead of the dot operator, equals(Object) instead of ==. Technical components (like hash maps) are hidden by default. See: http://www.talendforge.org/forum/viewtopic.p hp?pid=110860 Tips and Tricks When connecting, propagating changes to a DB component will change it to a built-in schema, which won't get updated after repo changes. On the other hand, remember that for lookup/join (i.e., SELECT) queries you can modify the query to only select the fields you need. Propagating the schema is useful then. Tips and Tricks Use a context for job variables. Note you can specify type for variables. You can read from a file or database, or pass in a context if an embedded Java job. Tips and Tricks For multi-host deployment: Export the job with a bootstrap context that has all variables, but populates only a context config location that is the same for all machines. The context config file has all values required for that host, e.g. test DB connection for test machine. You can rely on the fact that Windows will interpret root as the main system drive, so /Data/ will translate to C:\Data\ Be mindful of file permissions for sensitive context data (e.g., DB password) Tips and Tricks Use Bulk output components when possible. For transactional behavior: Start the job with DB connection Check use existing connection in all relevant components Check "Die on error" in all relevant components End job with commit component Room for Improvement UI stability Documentation Books Getting Started with Talend Open Studio for Data Integration by Bowen Jonathan Talend Open Studio Cookbook by Rick Daniel Barton Big Data book coming Talend Forge http://www.talendforge.org/ Forum super helpful Exchange free community components! Tutorials Bug tracker Source code Talend Resources http://www.talend.com/resources Help Center Knowledge Base Webinars, screencasts Tutorials Docs are on download page And by pressing F1 on a component Questions? Compliments? Consulting gigs? Contact me: [email protected] http://philip.yurchuk.com http://www.linkedin.com/in/philipyurchuk/ Thank You!