SlideShare a Scribd company logo
Automating Data
Pipelines: Moving
away from Scripts
and Excel
Kevin Scott
Director of Sales Engineering
Homegrown ETL solutions are common
Excel Excel, Python, SQL *-SQL, Java, C#
Manual Process Scripts Custom Applications
Naive assessment of the task
o “This is simple, we just need to…”
Urgency
o tight project deadline, no time for research/selection of third-party tools
Exceptional Requirements
o too challenging for a commercial off-the-shelf solution
Exceptional Team
o you have a highly skilled and available dev team eager to DIY
Historical Precedent
o you’ve always done it this way
Motivation for choosing homegrown solutions
Feature Gaps
o new end points, new DQ issues
Lack of transparency
o Logging, alerting, auditing, error reporting
Age
o Needs age-related overhaul, or has accumulated cruft
Maintenance Costs
o dev team has moved on (or you need the dev to move on…)
o maintenance costs ripple beyond that actual maintenance task – what else
could team be working on?
Scaling Issues
o can’t keep up with increased demand
Risks of choosing homegrown solutions
Designed in-house to solve specific in-house data problems
Use some combination of
o Manual processes
o Desktop tools
o Scripts
o Libraries
o Programs
o Data storage
o Operating System Services
Homegrown ETL Solutions
Using a Modern Data
Integration Platform to
properly automate your
data pipelines, in a robust,
scalable way, can eliminate
these risks and save a
significant amount of time.
In cloud — On premise — Hybrid
CloverDX Data Integration Platform
Automation of data
workloads from A to Z
One place for solving the
mundane and the complex
Productivity and trust
for the enterprise
Data self-service for everyone
CloverDX Data Integration Platform helps with..
Replacing legacy/home-grown tooling
Data ingestion/onboarding
Operational data and application integration
Data migration
Data quality
Data for BI and reporting
CloverDX High-level Architecture
Case Study
Ingesting data from many sources for analysis
Fintech Vertical
Business provides analysis services to credit unions
Accept input files from many client institutions
o Variable format
o Variable quality
Transform into standard format
Assess quality
Load into a warehouse for subsequent analysis
Case Study Scenario
As a manual process?
Automating Data Pipelines: Moving away from Scripts and Excel
As a scripted process?
Automating Data Pipelines: Moving away from Scripts and Excel
Using the CloverDX
Data Integration Platform…
Steps include:
o Detecting arrival of client files to be ingested
o Detecting format and layout of client files
o Reading client files
o Transforming/Mapping
o Assessing quality
o Loading to target
o Detecting/Logging at every step
End-to-end oversight of the ingest process
Steps include:
o Detecting arrival of client files to be ingested
o Detecting format and layout of client files
o Reading client files
o Transforming/Mapping
o Assessing quality
o Loading to target
o Detecting/Logging at every step
End-to-end oversight of the ingest process
Detect data
available for ingest
Match with
client-specific
processing rules
Read
Transform
Map
Validate
Load to warehouse
Update
ingestion log
Orchestrating the ingest process
Orchestrating the ingest process
Orchestrating the ingest process
Ingest process details
Read, validate,
transform,
write, log error
Run ingest jobs automatically, unattended
o Schedule jobs that look for files to onboard
o Listen for arrival of files to onboard
o Launch the onboarding process on-demand
Record all ingest activity
o Alerts when jobs fail
o Logs of every execution
o Graphical inspection of any run
CloverDX automates the ingest process
Run ingest jobs automatically and unattended
(Re)run ingest jobs on demand
Continually monitor ingest jobs
Visually inspect ingest job failures
Eliminate risks of using homegrown Scripts and Excel
Visually design your data jobs
Automate Execution
Instill confidence in operations
Save a significant amount of time
Use a Modern Data Integration Platform
More on automated data ingestion with CloverDX:
www.cloverdx.com/solutions/data-ingest
Request a CloverDX demo:
www.cloverdx.com/demo
Q&A
www.cloverdx.com/webinars

More Related Content

Similar to Automating Data Pipelines: Moving away from Scripts and Excel (20)

PDF
ADV Slides: Data Pipelines in the Enterprise and Comparison
DATAVERSITY
 
PDF
EPEX SPOT Virtualisation des données: Accélérateur du Time-to-market
Denodo
 
PDF
Azure BI Cloud Architectural Guidelines.pdf
pbonillo1
 
PDF
How Do You Build Data Pipelines that Are Agile, Automated, and Accurate?
Precisely
 
PPTX
20191106 brasil it 2
Pedro Junqueira
 
PPTX
Dynamics 365 saturday 2018 - data migration story
Andre Margono
 
PPTX
What is ETL?
Ismail El Gayar
 
PPTX
CloverDX for IBM Infosphere MDM (for 11.4 and later)
CloverDX
 
PDF
Big data analytics beyond beer and diapers
Kai Zhao
 
PPTX
Data Stack Summit 2023
Manimuthu Ayyannan
 
PPTX
Cloudera Sessions - Meet Mission Critical SLAs with Big Data
Cloudera, Inc.
 
PDF
IDEAS Global A.I. Conference 2022.pdf
Manimuthu Ayyannan
 
PPTX
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Pentaho
 
PDF
2010.03.16 Pollock.Edw2010.Modern D Ifor Warehousing
Jeffrey T. Pollock
 
PDF
Good Data: Collaborative Analytics On Demand
zsvoboda
 
PPTX
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Cloudera, Inc.
 
PDF
Slides: How Automating Data Lineage Improves BI Performance
DATAVERSITY
 
PDF
Data Engineering Services-Contata Solutions.pdf
Contata Solutions
 
PPTX
Hybrid Data Pipeline for SQL and REST
Sumit Sarkar
 
ADV Slides: Data Pipelines in the Enterprise and Comparison
DATAVERSITY
 
EPEX SPOT Virtualisation des données: Accélérateur du Time-to-market
Denodo
 
Azure BI Cloud Architectural Guidelines.pdf
pbonillo1
 
How Do You Build Data Pipelines that Are Agile, Automated, and Accurate?
Precisely
 
20191106 brasil it 2
Pedro Junqueira
 
Dynamics 365 saturday 2018 - data migration story
Andre Margono
 
What is ETL?
Ismail El Gayar
 
CloverDX for IBM Infosphere MDM (for 11.4 and later)
CloverDX
 
Big data analytics beyond beer and diapers
Kai Zhao
 
Data Stack Summit 2023
Manimuthu Ayyannan
 
Cloudera Sessions - Meet Mission Critical SLAs with Big Data
Cloudera, Inc.
 
IDEAS Global A.I. Conference 2022.pdf
Manimuthu Ayyannan
 
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Pentaho
 
2010.03.16 Pollock.Edw2010.Modern D Ifor Warehousing
Jeffrey T. Pollock
 
Good Data: Collaborative Analytics On Demand
zsvoboda
 
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Cloudera, Inc.
 
Slides: How Automating Data Lineage Improves BI Performance
DATAVERSITY
 
Data Engineering Services-Contata Solutions.pdf
Contata Solutions
 
Hybrid Data Pipeline for SQL and REST
Sumit Sarkar
 

More from CloverDX (10)

PPTX
Data architecture principles to accelerate your data strategy
CloverDX
 
PPTX
CloverDX 6.2 Release
CloverDX
 
PDF
Deploying ETL to Cloud
CloverDX
 
PDF
Moving Legacy Apps to Cloud: How to Avoid Risk
CloverDX
 
PDF
Starting Your Modern DataOps Journey
CloverDX
 
PDF
Modern management of data pipelines made easier
CloverDX
 
PDF
Removing Danger From Data
CloverDX
 
PDF
Data Anonymization For Better Software Testing
CloverDX
 
PDF
How to publish data and transformations over APIs with CloverDX Data Services
CloverDX
 
PPTX
Moving "Something Simple" To The Cloud - What It Really Takes
CloverDX
 
Data architecture principles to accelerate your data strategy
CloverDX
 
CloverDX 6.2 Release
CloverDX
 
Deploying ETL to Cloud
CloverDX
 
Moving Legacy Apps to Cloud: How to Avoid Risk
CloverDX
 
Starting Your Modern DataOps Journey
CloverDX
 
Modern management of data pipelines made easier
CloverDX
 
Removing Danger From Data
CloverDX
 
Data Anonymization For Better Software Testing
CloverDX
 
How to publish data and transformations over APIs with CloverDX Data Services
CloverDX
 
Moving "Something Simple" To The Cloud - What It Really Takes
CloverDX
 
Ad

Recently uploaded (20)

PPTX
NeuroStrata: Harnessing Neuro-Symbolic Paradigms for Improved Testability and...
Ivan Ruchkin
 
PDF
interacting-with-ai-2023---module-2---session-3---handout.pdf
cniclsh1
 
PDF
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
PPTX
PowerISO Crack 2025 – Free Download Full Version with Serial Key [Latest](1)....
HyperPc soft
 
PDF
LPS25 - Operationalizing MLOps in GEP - Terradue.pdf
terradue
 
PPTX
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
PPTX
computer forensics encase emager app exp6 1.pptx
ssuser343e92
 
PPTX
Cubase Pro Crack 2025 – Free Download Full Version with Activation Key
HyperPc soft
 
PPTX
ERP - FICO Presentation BY BSL BOKARO STEEL LIMITED.pptx
ravisranjan
 
PDF
From Chaos to Clarity: Mastering Analytics Governance in the Modern Enterprise
Wiiisdom
 
PPTX
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
PDF
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
PPTX
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
PPTX
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
PDF
>Nitro Pro Crack 14.36.1.0 + Keygen Free Download [Latest]
utfefguu
 
PPTX
WYSIWYG Web Builder Crack 2025 – Free Download Full Version with License Key
HyperPc soft
 
PPTX
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
PDF
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
PDF
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
PDF
Dealing with JSON in the relational world
Andres Almiray
 
NeuroStrata: Harnessing Neuro-Symbolic Paradigms for Improved Testability and...
Ivan Ruchkin
 
interacting-with-ai-2023---module-2---session-3---handout.pdf
cniclsh1
 
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
PowerISO Crack 2025 – Free Download Full Version with Serial Key [Latest](1)....
HyperPc soft
 
LPS25 - Operationalizing MLOps in GEP - Terradue.pdf
terradue
 
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
computer forensics encase emager app exp6 1.pptx
ssuser343e92
 
Cubase Pro Crack 2025 – Free Download Full Version with Activation Key
HyperPc soft
 
ERP - FICO Presentation BY BSL BOKARO STEEL LIMITED.pptx
ravisranjan
 
From Chaos to Clarity: Mastering Analytics Governance in the Modern Enterprise
Wiiisdom
 
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
>Nitro Pro Crack 14.36.1.0 + Keygen Free Download [Latest]
utfefguu
 
WYSIWYG Web Builder Crack 2025 – Free Download Full Version with License Key
HyperPc soft
 
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
Dealing with JSON in the relational world
Andres Almiray
 
Ad

Automating Data Pipelines: Moving away from Scripts and Excel

  • 1. Automating Data Pipelines: Moving away from Scripts and Excel Kevin Scott Director of Sales Engineering
  • 2. Homegrown ETL solutions are common Excel Excel, Python, SQL *-SQL, Java, C# Manual Process Scripts Custom Applications
  • 3. Naive assessment of the task o “This is simple, we just need to…” Urgency o tight project deadline, no time for research/selection of third-party tools Exceptional Requirements o too challenging for a commercial off-the-shelf solution Exceptional Team o you have a highly skilled and available dev team eager to DIY Historical Precedent o you’ve always done it this way Motivation for choosing homegrown solutions
  • 4. Feature Gaps o new end points, new DQ issues Lack of transparency o Logging, alerting, auditing, error reporting Age o Needs age-related overhaul, or has accumulated cruft Maintenance Costs o dev team has moved on (or you need the dev to move on…) o maintenance costs ripple beyond that actual maintenance task – what else could team be working on? Scaling Issues o can’t keep up with increased demand Risks of choosing homegrown solutions
  • 5. Designed in-house to solve specific in-house data problems Use some combination of o Manual processes o Desktop tools o Scripts o Libraries o Programs o Data storage o Operating System Services Homegrown ETL Solutions
  • 6. Using a Modern Data Integration Platform to properly automate your data pipelines, in a robust, scalable way, can eliminate these risks and save a significant amount of time.
  • 7. In cloud — On premise — Hybrid CloverDX Data Integration Platform Automation of data workloads from A to Z One place for solving the mundane and the complex Productivity and trust for the enterprise Data self-service for everyone
  • 8. CloverDX Data Integration Platform helps with.. Replacing legacy/home-grown tooling Data ingestion/onboarding Operational data and application integration Data migration Data quality Data for BI and reporting
  • 10. Case Study Ingesting data from many sources for analysis
  • 11. Fintech Vertical Business provides analysis services to credit unions Accept input files from many client institutions o Variable format o Variable quality Transform into standard format Assess quality Load into a warehouse for subsequent analysis Case Study Scenario
  • 12. As a manual process?
  • 14. As a scripted process?
  • 16. Using the CloverDX Data Integration Platform…
  • 17. Steps include: o Detecting arrival of client files to be ingested o Detecting format and layout of client files o Reading client files o Transforming/Mapping o Assessing quality o Loading to target o Detecting/Logging at every step End-to-end oversight of the ingest process
  • 18. Steps include: o Detecting arrival of client files to be ingested o Detecting format and layout of client files o Reading client files o Transforming/Mapping o Assessing quality o Loading to target o Detecting/Logging at every step End-to-end oversight of the ingest process Detect data available for ingest Match with client-specific processing rules Read Transform Map Validate Load to warehouse Update ingestion log
  • 22. Ingest process details Read, validate, transform, write, log error
  • 23. Run ingest jobs automatically, unattended o Schedule jobs that look for files to onboard o Listen for arrival of files to onboard o Launch the onboarding process on-demand Record all ingest activity o Alerts when jobs fail o Logs of every execution o Graphical inspection of any run CloverDX automates the ingest process
  • 24. Run ingest jobs automatically and unattended
  • 25. (Re)run ingest jobs on demand
  • 27. Visually inspect ingest job failures
  • 28. Eliminate risks of using homegrown Scripts and Excel Visually design your data jobs Automate Execution Instill confidence in operations Save a significant amount of time Use a Modern Data Integration Platform
  • 29. More on automated data ingestion with CloverDX: www.cloverdx.com/solutions/data-ingest Request a CloverDX demo: www.cloverdx.com/demo Q&A www.cloverdx.com/webinars

Editor's Notes

  • #14: You can certainly envision how to do this manually. Open your favorite FTP program to grab the files, copy them to your local workspace, open them, visually inspect them. Run the data import wizard in your SQLWorkbench. You can also envision all the reasons this is impractical. Huge data files. Too many files. How often the process needs to run.
  • #16: You can probably also think about how to simplify the process and begin to automate. A shell script to pull the files from the FTP site. Choose your favorite animal from the O’reilly menagerie. scripting language for validation. SQL scripts to load data to the repository. Maybe add further efficiencies by more shell scripts to start hooking these steps together. Less time consuming, but still rather ad-hoc, still error prone, and still taking staff resources away from more valuable work.   CloverETL will allow you to automate this data management process - to orchestrate, monitor and alert the entire workflow. Take people completely out of the loop, de-risking, removing sources of error, keeping logs of all activity and alerting the right people when errors occur and intervention is needed.