SlideShare a Scribd company logo
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Sourabh Bajaj, Software Engineer, Coursera
October 2015
BDT404
Large-Scale ETL Data Flows
With Data Pipeline and Dataduct
What to Expect from the Session
● Learn about:
• How we use AWS Data Pipeline to manage ETL at Coursera
• Why we built Dataduct, an open source framework from
Coursera for running pipelines
• How Dataduct enables developers to write their own ETL
pipelines for their services
• Best practices for managing pipelines
• How to start using Dataduct for your own pipelines
Coursera
Coursera
Coursera
120
partners
2.5 million
course completions
Education at Scale
15 million
learners worldwide
1300
courses
Data Warehousing at Coursera
Amazon Redshift
167 Amazon Redshift users
1200 EDW tables
22 source systems
6 dc1.8xlarge instances
30,000,000 queries run
Data Flow
Amazon
Redshift
Amazon
RDS
Amazon EMR Amazon S3
Event
Feeds
Amazon EC2
Amazon
RDS
Amazon S3
BI Applications
Third Party
Tools
Cassandra
Cassandra
AWS Data Pipeline
Data Flow
Amazon
Redshift
Amazon
RDS
Amazon EMR Amazon S3
Event
Feeds
Amazon EC2
Amazon
RDS
Amazon S3
BI Applications
Third Party
Tools
Cassandra
Cassandra
AWS Data Pipeline
Data Flow
Amazon
Redshift
Amazon
RDS
Amazon EMR Amazon S3
Event
Feeds
Amazon EC2
Amazon
RDS
Amazon S3
BI Applications
Third Party
Tools
Cassandra
Cassandra
AWS Data Pipeline
Data Flow
Amazon
Redshift
Amazon
RDS
Amazon EMR Amazon S3
Event
Feeds
Amazon EC2
Amazon
RDS
Amazon S3
BI Applications
Third Party
Tools
Cassandra
Cassandra
AWS Data Pipeline
Data Flow
Amazon
Redshift
Amazon
RDS
Amazon EMR Amazon S3
Event
Feeds
Amazon EC2
Amazon
RDS
Amazon S3
BI Applications
Third Party
Tools
Cassandra
Cassandra
AWS Data Pipeline
ETL at Coursera
150 Active pipelines 44 Dataduct developers
Requirements for an ETL system
Fault Tolerance Scheduling Dependency
Management
Resource Management Monitoring Easy Development
Requirements for an ETL system
Fault Tolerance Scheduling Dependency
Management
Resource Management Monitoring Easy Development
Requirements for an ETL system
Fault Tolerance Scheduling Dependency
Management
Resource Management Monitoring Easy Development
Requirements for an ETL system
Fault Tolerance Scheduling Dependency
Management
Resource Management Monitoring Easy Development
Requirements for an ETL system
Fault Tolerance Scheduling Dependency
Management
Resource Management Monitoring Easy Development
Requirements for an ETL system
Fault Tolerance Scheduling Dependency
Management
Resource Management Monitoring Easy Development
Requirements for an ETL system
Fault Tolerance Scheduling Dependency
Management
Resource Management Monitoring Easy Development
Dataduct
● Open source wrapper around AWS Data Pipeline
Dataduct
Dataduct
● Open source wrapper around AWS Data Pipeline
● It provides:
• Code reuse
• Extensibility
• Command line interface
• Staging environment support
• Dynamic updates
Dataduct
● Repository
• https://github.com/coursera/dataduct
● Documentation
• http://dataduct.readthedocs.org/en/latest/
● Installation
• pip install dataduct
Let’s build some pipelines
Pipeline 1: Amazon RDS → Amazon Redshift
● Let’s start with a simple pipeline of pulling data from a
relational store to Amazon Redshift
Amazon
Redshift
Amazon
RDS
Amazon S3
Amazon EC2
AWS Data Pipeline
Pipeline 1: Amazon RDS → Amazon Redshift
● Definition in YAML
● Steps
● Shared Config
● Visualization
● Overrides
● Reusable code
Pipeline 1: Amazon RDS → Amazon Redshift
Pipeline 1: Amazon RDS → Amazon Redshift
(Steps)
● Extract RDS
• Fetch data from Amazon RDS and output to Amazon S3
Pipeline 1: Amazon RDS → Amazon Redshift
(Steps)
● Create-Load-Redshift
• Create table if it doesn’t exist and load data using COPY.
Pipeline 1: Amazon RDS → Amazon Redshift
● Upsert
• Update and insert into the production table from staging.
Pipeline 1: Amazon RDS → Amazon Redshift
(Tasks)
● Bootstrap
• Fully automated
• Fetch latest binaries from Amazon S3 for Amazon EC2 /
Amazon EMR
• Install any updated dependencies on the resource
• Make sure that the pipeline would run the latest version of code
Pipeline 1: Amazon RDS → Amazon Redshift
● Quality assurance
• Primary key violations in the warehouse.
• Dropped rows: By comparing the number of rows.
• Corrupted rows: By comparing a sample set of rows.
• Automatically done within UPSERT
Pipeline 1: Amazon RDS → Amazon Redshift
● Teardown
• Amazon SNS alerting for failed tasks
• Logging of task failures
• Monitoring
• Run times
• Retries
• Machine health
Pipeline 1: Amazon RDS → Amazon Redshift
(Config)
● Visualization
• Automatically generated
by Dataduct
• Allows easy debugging
Pipeline 1: Amazon RDS → Amazon Redshift
● Shared Config
• IAM roles
• AMI
• Security group
• Retries
• Custom steps
• Resource paths
Pipeline 1: Amazon RDS → Amazon Redshift
● Custom steps
• Open-sourced steps can easily be shared across multiple
pipelines
• You can also create new steps and add them using the config
Deploying a pipeline
● Command line interface for all operations
usage: dataduct pipeline activate [-h] [-m MODE] [-f] [-t TIME_DELTA] [-b]
pipeline_definitions
[pipeline_definitions ...]
Pipeline 2: Cassandra → Amazon Redshift
Amazon
Redshift
Amazon EMR
(Scalding)
Amazon S3
Amazon S3 Amazon EMR
(Aegisthus)
Cassandra
AWS Data Pipeline
● Shell command activity to rescue
Pipeline 2: Cassandra → Amazon Redshift
● Shell command activity to rescue
● Priam backups of Cassandra to Amazon S3
Pipeline 2: Cassandra → Amazon Redshift
● Shell command activity to rescue
● Priam backups of Cassandra to S3
● Aegisthus to parse SSTables into Avro dumps
Pipeline 2: Cassandra → Amazon Redshift
● Shell command activity to rescue
● Priam backups of Cassandra to Amazon S3
● Aegisthus to parse SSTables into Avro dumps
● Scalding to process Aegisthus output
• Extend the base steps to create more patterns
Pipeline 2: Cassandra → Amazon Redshift
Pipeline 2: Cassandra → Amazon Redshift
● Custom steps
• Aegisthus
• Scalding
Pipeline 2: Cassandra → Amazon Redshift
● EMR-Config overrides the defaults
Pipeline 2: Cassandra → Amazon Redshift
● Multiple output nodes from transform step
Pipeline 2: Cassandra → Amazon Redshift
● Bootstrap
• Save every pipeline definition
• Fetching new jar for the Amazon EMR jobs
• Specify the same Hadoop / Hive metastore installation
Pipeline 2: Cassandra → Amazon Redshift
Data products
● We’ve talked about data into the warehouse
● Common pattern:
• Wait for dependencies
• Computation inside redshift to create derived tables
• Amazon EMR activities for more complex process
• Load back into MySQL / Cassandra
• Product feature queries MySQL / Cassandra
● Used in recommendations, dashboards, and search
Recommendations
● Objective
• Connecting the learner to right content
● Use cases:
• Recommendations email
• Course discovery
• Reactivation of the users
Recommendations
● Computation inside Amazon Redshift to create derived
tables for co-enrollments
● Amazon EMR job for model training
● Model file pushed to Amazon S3
● Prediction API uses the updated model file
● Contract between the prediction and the training layer is
via model definition.
Internal Dashboard
● Objective
• Serve internal dashboards to create a data driven culture
● Use Cases
• Key performance indicators for the company
• Track results for different A/B experiments
Internal Dashboard
● Do:
• Monitoring (run times, retries, deploys, query times)
Learnings
● Do:
• Monitoring (run times, retries, deploys, query times)
• Code should live in library instead of scripts being passed to
every pipeline
Learnings
● Do:
• Monitoring (run times, retries, deploys, query times)
• Code should live in library instead of scripts being passed to
every pipeline
• Test environment in staging should mimic prod
Learnings
● Do:
• Monitoring (run times, retries, deploys, query times)
• Code should live in library instead of scripts being passed to
every pipeline
• Test environment in staging should mimic prod
• Shared library to democratize writing of ETL
Learnings
● Do:
• Monitoring (run times, retries, deploys, query times)
• Code should live in library instead of scripts being passed to
every pipeline
• Test environment in staging should mimic prod
• Shared library to democratize writing of ETL
• Using read-replicas and backups
Learnings
● Don’t:
• Same code passed to multiple pipelines as a script
Learnings
● Don’t:
• Same code passed to multiple pipelines as a script
• Non version controlled pipelines
Learnings
● Don’t:
• Same code passed to multiple pipelines as a script
• Non version controlled pipelines
• Really huge pipelines instead of modular small pipelines with
dependencies
Learnings
● Don’t:
• Same code passed to multiple pipelines as a script
• Non version controlled pipelines
• Really huge pipelines instead of modular small pipelines with
dependencies
• Not catching resource timeouts or load delays
Learnings
Dataduct
● Code reuse
● Extensibility
● Command line interface
● Staging environment support
● Dynamic updates
Dataduct
● Repository
• https://github.com/coursera/dataduct
● Documentation
• http://dataduct.readthedocs.org/en/latest/
● Installation
• pip install dataduct
Questions?
Also, we are hiring!
https://www.coursera.org/jobs
Remember to complete
your evaluations!
Thank you!
Also, we are hiring!
https://www.coursera.org/jobs
Sourabh Bajaj
sb2nov
@sb2nov
Ad

More Related Content

What's hot (20)

Data Pipelines with Kafka Connect
Data Pipelines with Kafka ConnectData Pipelines with Kafka Connect
Data Pipelines with Kafka Connect
Kaufman Ng
 
Spark streaming + kafka 0.10
Spark streaming + kafka 0.10Spark streaming + kafka 0.10
Spark streaming + kafka 0.10
Joan Viladrosa Riera
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Evan Chan
 
Kafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internalsKafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internals
Ayyappadas Ravindran (Appu)
 
Capture the Streams of Database Changes
Capture the Streams of Database ChangesCapture the Streams of Database Changes
Capture the Streams of Database Changes
confluent
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan
 
The Many Faces of Apache Kafka: Leveraging real-time data at scale
The Many Faces of Apache Kafka: Leveraging real-time data at scaleThe Many Faces of Apache Kafka: Leveraging real-time data at scale
The Many Faces of Apache Kafka: Leveraging real-time data at scale
Neha Narkhede
 
Do's and don'ts when deploying akka in production
Do's and don'ts when deploying akka in productionDo's and don'ts when deploying akka in production
Do's and don'ts when deploying akka in production
jglobal
 
Stream processing in python with Apache Samza and Beam
Stream processing in python with Apache Samza and BeamStream processing in python with Apache Samza and Beam
Stream processing in python with Apache Samza and Beam
Hai Lu
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
Gwen (Chen) Shapira
 
Power of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data StructuresPower of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data Structures
confluent
 
Mobius: C# Language Binding For Spark
Mobius: C# Language Binding For SparkMobius: C# Language Binding For Spark
Mobius: C# Language Binding For Spark
Spark Summit
 
Kafka and Spark Streaming
Kafka and Spark StreamingKafka and Spark Streaming
Kafka and Spark Streaming
datamantra
 
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
confluent
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
Evan Chan
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Anand Narayanan
 
Revitalizing Enterprise Integration with Reactive Streams
Revitalizing Enterprise Integration with Reactive StreamsRevitalizing Enterprise Integration with Reactive Streams
Revitalizing Enterprise Integration with Reactive Streams
Lightbend
 
Operationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML ModelsOperationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML Models
Lightbend
 
Developing Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For ScalaDeveloping Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For Scala
Lightbend
 
Data Pipelines with Kafka Connect
Data Pipelines with Kafka ConnectData Pipelines with Kafka Connect
Data Pipelines with Kafka Connect
Kaufman Ng
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Evan Chan
 
Capture the Streams of Database Changes
Capture the Streams of Database ChangesCapture the Streams of Database Changes
Capture the Streams of Database Changes
confluent
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan
 
The Many Faces of Apache Kafka: Leveraging real-time data at scale
The Many Faces of Apache Kafka: Leveraging real-time data at scaleThe Many Faces of Apache Kafka: Leveraging real-time data at scale
The Many Faces of Apache Kafka: Leveraging real-time data at scale
Neha Narkhede
 
Do's and don'ts when deploying akka in production
Do's and don'ts when deploying akka in productionDo's and don'ts when deploying akka in production
Do's and don'ts when deploying akka in production
jglobal
 
Stream processing in python with Apache Samza and Beam
Stream processing in python with Apache Samza and BeamStream processing in python with Apache Samza and Beam
Stream processing in python with Apache Samza and Beam
Hai Lu
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
Gwen (Chen) Shapira
 
Power of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data StructuresPower of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data Structures
confluent
 
Mobius: C# Language Binding For Spark
Mobius: C# Language Binding For SparkMobius: C# Language Binding For Spark
Mobius: C# Language Binding For Spark
Spark Summit
 
Kafka and Spark Streaming
Kafka and Spark StreamingKafka and Spark Streaming
Kafka and Spark Streaming
datamantra
 
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
confluent
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
Evan Chan
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Anand Narayanan
 
Revitalizing Enterprise Integration with Reactive Streams
Revitalizing Enterprise Integration with Reactive StreamsRevitalizing Enterprise Integration with Reactive Streams
Revitalizing Enterprise Integration with Reactive Streams
Lightbend
 
Operationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML ModelsOperationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML Models
Lightbend
 
Developing Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For ScalaDeveloping Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For Scala
Lightbend
 

Viewers also liked (20)

AWS_Data_Pipeline
AWS_Data_PipelineAWS_Data_Pipeline
AWS_Data_Pipeline
Ahasan Habib
 
Scotland's castles Rafa Garcia
Scotland's castles Rafa Garcia Scotland's castles Rafa Garcia
Scotland's castles Rafa Garcia
Adri9C
 
Apresentação Nativi
Apresentação Nativi Apresentação Nativi
Apresentação Nativi
Renan Ranzani
 
HadoopCompression
HadoopCompressionHadoopCompression
HadoopCompression
Demet Aksoy
 
Proyecto gerencia industrial iupsmpzo.
Proyecto gerencia industrial   iupsmpzo.Proyecto gerencia industrial   iupsmpzo.
Proyecto gerencia industrial iupsmpzo.
Yumar Rondon
 
Artículo sobre consideraciones fundamentales del muestreo
Artículo sobre consideraciones fundamentales del muestreoArtículo sobre consideraciones fundamentales del muestreo
Artículo sobre consideraciones fundamentales del muestreo
Yumar Rondon
 
Phrases (1) (1)
Phrases (1) (1)Phrases (1) (1)
Phrases (1) (1)
ishlive
 
Proceso de manufactura unidad iii
Proceso de manufactura unidad iiiProceso de manufactura unidad iii
Proceso de manufactura unidad iii
Yumar Rondon
 
Go Zero to Big Data in 15 Minutes with the Hortonworks Sandbox
Go Zero to Big Data in 15 Minutes with the Hortonworks SandboxGo Zero to Big Data in 15 Minutes with the Hortonworks Sandbox
Go Zero to Big Data in 15 Minutes with the Hortonworks Sandbox
Hortonworks
 
Data Process Systems, connecting everything
Data Process Systems, connecting everythingData Process Systems, connecting everything
Data Process Systems, connecting everything
DataWorks Summit/Hadoop Summit
 
Agile retrospective
Agile retrospectiveAgile retrospective
Agile retrospective
GMO-Z.com Vietnam Lab Center
 
Impression techiques / implant dentistry course/ implant dentistry coursevvv
Impression techiques  / implant dentistry course/ implant dentistry coursevvvImpression techiques  / implant dentistry course/ implant dentistry coursevvv
Impression techiques / implant dentistry course/ implant dentistry coursevvv
Indian dental academy
 
Airflow at WePay
Airflow at WePayAirflow at WePay
Airflow at WePay
Chris Riccomini
 
Hadoop & Spark Performance tuning using Dr. Elephant
Hadoop & Spark Performance tuning using Dr. ElephantHadoop & Spark Performance tuning using Dr. Elephant
Hadoop & Spark Performance tuning using Dr. Elephant
Akshay Rai
 
Choice Based Credit System
Choice Based Credit SystemChoice Based Credit System
Choice Based Credit System
Madan Mankotia
 
Hadoop in Healthcare Systems
Hadoop in Healthcare SystemsHadoop in Healthcare Systems
Hadoop in Healthcare Systems
DataWorks Summit/Hadoop Summit
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
DataWorks Summit
 
Big Data Benchmarking
Big Data BenchmarkingBig Data Benchmarking
Big Data Benchmarking
Venkata Naga Ravi
 
Scotland's castles Rafa Garcia
Scotland's castles Rafa Garcia Scotland's castles Rafa Garcia
Scotland's castles Rafa Garcia
Adri9C
 
Apresentação Nativi
Apresentação Nativi Apresentação Nativi
Apresentação Nativi
Renan Ranzani
 
HadoopCompression
HadoopCompressionHadoopCompression
HadoopCompression
Demet Aksoy
 
Proyecto gerencia industrial iupsmpzo.
Proyecto gerencia industrial   iupsmpzo.Proyecto gerencia industrial   iupsmpzo.
Proyecto gerencia industrial iupsmpzo.
Yumar Rondon
 
Artículo sobre consideraciones fundamentales del muestreo
Artículo sobre consideraciones fundamentales del muestreoArtículo sobre consideraciones fundamentales del muestreo
Artículo sobre consideraciones fundamentales del muestreo
Yumar Rondon
 
Phrases (1) (1)
Phrases (1) (1)Phrases (1) (1)
Phrases (1) (1)
ishlive
 
Proceso de manufactura unidad iii
Proceso de manufactura unidad iiiProceso de manufactura unidad iii
Proceso de manufactura unidad iii
Yumar Rondon
 
Go Zero to Big Data in 15 Minutes with the Hortonworks Sandbox
Go Zero to Big Data in 15 Minutes with the Hortonworks SandboxGo Zero to Big Data in 15 Minutes with the Hortonworks Sandbox
Go Zero to Big Data in 15 Minutes with the Hortonworks Sandbox
Hortonworks
 
Impression techiques / implant dentistry course/ implant dentistry coursevvv
Impression techiques  / implant dentistry course/ implant dentistry coursevvvImpression techiques  / implant dentistry course/ implant dentistry coursevvv
Impression techiques / implant dentistry course/ implant dentistry coursevvv
Indian dental academy
 
Hadoop & Spark Performance tuning using Dr. Elephant
Hadoop & Spark Performance tuning using Dr. ElephantHadoop & Spark Performance tuning using Dr. Elephant
Hadoop & Spark Performance tuning using Dr. Elephant
Akshay Rai
 
Choice Based Credit System
Choice Based Credit SystemChoice Based Credit System
Choice Based Credit System
Madan Mankotia
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
DataWorks Summit
 
Ad

Similar to Large-Scale ETL Data Flows With Data Pipeline and Dataduct (15)

Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
Paolo latella
 
Aws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaAws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon Elisha
Helen Rogers
 
Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructure
harendra_pathak
 
Migrating Monolithic Applications with the Strangler Pattern
Migrating Monolithic Applications with the Strangler Pattern Migrating Monolithic Applications with the Strangler Pattern
Migrating Monolithic Applications with the Strangler Pattern
Thanh Nguyen
 
London Redshift Meetup - July 2017
London Redshift Meetup - July 2017London Redshift Meetup - July 2017
London Redshift Meetup - July 2017
Pratim Das
 
muCon 2017 - 12 Factor Serverless Applications
muCon 2017 - 12 Factor Serverless ApplicationsmuCon 2017 - 12 Factor Serverless Applications
muCon 2017 - 12 Factor Serverless Applications
Chris Munns
 
Journey Towards Scaling Your Application to Million Users
Journey Towards Scaling Your Application to Million UsersJourney Towards Scaling Your Application to Million Users
Journey Towards Scaling Your Application to Million Users
Adrian Hornsby
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
N Masahiro
 
JustGiving – Serverless Data Pipelines, API, Messaging and Stream Processing
JustGiving – Serverless Data Pipelines,  API, Messaging and Stream ProcessingJustGiving – Serverless Data Pipelines,  API, Messaging and Stream Processing
JustGiving – Serverless Data Pipelines, API, Messaging and Stream Processing
Luis Gonzalez
 
JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing
JustGiving | Serverless Data Pipelines, API, Messaging and Stream ProcessingJustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing
JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing
BEEVA_es
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
Chris Purrington
 
12 Factor Serverless Applications - Mike Morain, AWS - Cloud Native Day Tel A...
12 Factor Serverless Applications - Mike Morain, AWS - Cloud Native Day Tel A...12 Factor Serverless Applications - Mike Morain, AWS - Cloud Native Day Tel A...
12 Factor Serverless Applications - Mike Morain, AWS - Cloud Native Day Tel A...
Cloud Native Day Tel Aviv
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
Ml2
Ml2Ml2
Ml2
poovarasu maniandan
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
Paolo latella
 
Aws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaAws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon Elisha
Helen Rogers
 
Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructure
harendra_pathak
 
Migrating Monolithic Applications with the Strangler Pattern
Migrating Monolithic Applications with the Strangler Pattern Migrating Monolithic Applications with the Strangler Pattern
Migrating Monolithic Applications with the Strangler Pattern
Thanh Nguyen
 
London Redshift Meetup - July 2017
London Redshift Meetup - July 2017London Redshift Meetup - July 2017
London Redshift Meetup - July 2017
Pratim Das
 
muCon 2017 - 12 Factor Serverless Applications
muCon 2017 - 12 Factor Serverless ApplicationsmuCon 2017 - 12 Factor Serverless Applications
muCon 2017 - 12 Factor Serverless Applications
Chris Munns
 
Journey Towards Scaling Your Application to Million Users
Journey Towards Scaling Your Application to Million UsersJourney Towards Scaling Your Application to Million Users
Journey Towards Scaling Your Application to Million Users
Adrian Hornsby
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
N Masahiro
 
JustGiving – Serverless Data Pipelines, API, Messaging and Stream Processing
JustGiving – Serverless Data Pipelines,  API, Messaging and Stream ProcessingJustGiving – Serverless Data Pipelines,  API, Messaging and Stream Processing
JustGiving – Serverless Data Pipelines, API, Messaging and Stream Processing
Luis Gonzalez
 
JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing
JustGiving | Serverless Data Pipelines, API, Messaging and Stream ProcessingJustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing
JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing
BEEVA_es
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
Chris Purrington
 
12 Factor Serverless Applications - Mike Morain, AWS - Cloud Native Day Tel A...
12 Factor Serverless Applications - Mike Morain, AWS - Cloud Native Day Tel A...12 Factor Serverless Applications - Mike Morain, AWS - Cloud Native Day Tel A...
12 Factor Serverless Applications - Mike Morain, AWS - Cloud Native Day Tel A...
Cloud Native Day Tel Aviv
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
 
Ad

Recently uploaded (20)

Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Eric D. Schabell
 
EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
Andre Hora
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)
Allon Mureinik
 
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Andre Hora
 
How can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptxHow can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptx
laravinson24
 
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
ssuserb14185
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
Not So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java WebinarNot So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java Webinar
Tier1 app
 
Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025
kashifyounis067
 
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Dele Amefo
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
Exploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the FutureExploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the Future
ICS
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
F-Secure Freedome VPN 2025 Crack Plus Activation  New VersionF-Secure Freedome VPN 2025 Crack Plus Activation  New Version
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
saimabibi60507
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Eric D. Schabell
 
EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
Andre Hora
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)
Allon Mureinik
 
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Andre Hora
 
How can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptxHow can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptx
laravinson24
 
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
ssuserb14185
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
Not So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java WebinarNot So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java Webinar
Tier1 app
 
Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025
kashifyounis067
 
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Dele Amefo
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
Exploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the FutureExploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the Future
ICS
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
F-Secure Freedome VPN 2025 Crack Plus Activation  New VersionF-Secure Freedome VPN 2025 Crack Plus Activation  New Version
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
saimabibi60507
 

Large-Scale ETL Data Flows With Data Pipeline and Dataduct

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Sourabh Bajaj, Software Engineer, Coursera October 2015 BDT404 Large-Scale ETL Data Flows With Data Pipeline and Dataduct
  • 2. What to Expect from the Session ● Learn about: • How we use AWS Data Pipeline to manage ETL at Coursera • Why we built Dataduct, an open source framework from Coursera for running pipelines • How Dataduct enables developers to write their own ETL pipelines for their services • Best practices for managing pipelines • How to start using Dataduct for your own pipelines
  • 6. 120 partners 2.5 million course completions Education at Scale 15 million learners worldwide 1300 courses
  • 7. Data Warehousing at Coursera Amazon Redshift 167 Amazon Redshift users 1200 EDW tables 22 source systems 6 dc1.8xlarge instances 30,000,000 queries run
  • 8. Data Flow Amazon Redshift Amazon RDS Amazon EMR Amazon S3 Event Feeds Amazon EC2 Amazon RDS Amazon S3 BI Applications Third Party Tools Cassandra Cassandra AWS Data Pipeline
  • 9. Data Flow Amazon Redshift Amazon RDS Amazon EMR Amazon S3 Event Feeds Amazon EC2 Amazon RDS Amazon S3 BI Applications Third Party Tools Cassandra Cassandra AWS Data Pipeline
  • 10. Data Flow Amazon Redshift Amazon RDS Amazon EMR Amazon S3 Event Feeds Amazon EC2 Amazon RDS Amazon S3 BI Applications Third Party Tools Cassandra Cassandra AWS Data Pipeline
  • 11. Data Flow Amazon Redshift Amazon RDS Amazon EMR Amazon S3 Event Feeds Amazon EC2 Amazon RDS Amazon S3 BI Applications Third Party Tools Cassandra Cassandra AWS Data Pipeline
  • 12. Data Flow Amazon Redshift Amazon RDS Amazon EMR Amazon S3 Event Feeds Amazon EC2 Amazon RDS Amazon S3 BI Applications Third Party Tools Cassandra Cassandra AWS Data Pipeline
  • 13. ETL at Coursera 150 Active pipelines 44 Dataduct developers
  • 14. Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development
  • 15. Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development
  • 16. Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development
  • 17. Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development
  • 18. Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development
  • 19. Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development
  • 20. Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development
  • 22. ● Open source wrapper around AWS Data Pipeline Dataduct
  • 23. Dataduct ● Open source wrapper around AWS Data Pipeline ● It provides: • Code reuse • Extensibility • Command line interface • Staging environment support • Dynamic updates
  • 24. Dataduct ● Repository • https://github.com/coursera/dataduct ● Documentation • http://dataduct.readthedocs.org/en/latest/ ● Installation • pip install dataduct
  • 25. Let’s build some pipelines
  • 26. Pipeline 1: Amazon RDS → Amazon Redshift ● Let’s start with a simple pipeline of pulling data from a relational store to Amazon Redshift Amazon Redshift Amazon RDS Amazon S3 Amazon EC2 AWS Data Pipeline
  • 27. Pipeline 1: Amazon RDS → Amazon Redshift
  • 28. ● Definition in YAML ● Steps ● Shared Config ● Visualization ● Overrides ● Reusable code Pipeline 1: Amazon RDS → Amazon Redshift
  • 29. Pipeline 1: Amazon RDS → Amazon Redshift (Steps) ● Extract RDS • Fetch data from Amazon RDS and output to Amazon S3
  • 30. Pipeline 1: Amazon RDS → Amazon Redshift (Steps) ● Create-Load-Redshift • Create table if it doesn’t exist and load data using COPY.
  • 31. Pipeline 1: Amazon RDS → Amazon Redshift ● Upsert • Update and insert into the production table from staging.
  • 32. Pipeline 1: Amazon RDS → Amazon Redshift (Tasks) ● Bootstrap • Fully automated • Fetch latest binaries from Amazon S3 for Amazon EC2 / Amazon EMR • Install any updated dependencies on the resource • Make sure that the pipeline would run the latest version of code
  • 33. Pipeline 1: Amazon RDS → Amazon Redshift ● Quality assurance • Primary key violations in the warehouse. • Dropped rows: By comparing the number of rows. • Corrupted rows: By comparing a sample set of rows. • Automatically done within UPSERT
  • 34. Pipeline 1: Amazon RDS → Amazon Redshift ● Teardown • Amazon SNS alerting for failed tasks • Logging of task failures • Monitoring • Run times • Retries • Machine health
  • 35. Pipeline 1: Amazon RDS → Amazon Redshift (Config) ● Visualization • Automatically generated by Dataduct • Allows easy debugging
  • 36. Pipeline 1: Amazon RDS → Amazon Redshift ● Shared Config • IAM roles • AMI • Security group • Retries • Custom steps • Resource paths
  • 37. Pipeline 1: Amazon RDS → Amazon Redshift ● Custom steps • Open-sourced steps can easily be shared across multiple pipelines • You can also create new steps and add them using the config
  • 38. Deploying a pipeline ● Command line interface for all operations usage: dataduct pipeline activate [-h] [-m MODE] [-f] [-t TIME_DELTA] [-b] pipeline_definitions [pipeline_definitions ...]
  • 39. Pipeline 2: Cassandra → Amazon Redshift Amazon Redshift Amazon EMR (Scalding) Amazon S3 Amazon S3 Amazon EMR (Aegisthus) Cassandra AWS Data Pipeline
  • 40. ● Shell command activity to rescue Pipeline 2: Cassandra → Amazon Redshift
  • 41. ● Shell command activity to rescue ● Priam backups of Cassandra to Amazon S3 Pipeline 2: Cassandra → Amazon Redshift
  • 42. ● Shell command activity to rescue ● Priam backups of Cassandra to S3 ● Aegisthus to parse SSTables into Avro dumps Pipeline 2: Cassandra → Amazon Redshift
  • 43. ● Shell command activity to rescue ● Priam backups of Cassandra to Amazon S3 ● Aegisthus to parse SSTables into Avro dumps ● Scalding to process Aegisthus output • Extend the base steps to create more patterns Pipeline 2: Cassandra → Amazon Redshift
  • 44. Pipeline 2: Cassandra → Amazon Redshift
  • 45. ● Custom steps • Aegisthus • Scalding Pipeline 2: Cassandra → Amazon Redshift
  • 46. ● EMR-Config overrides the defaults Pipeline 2: Cassandra → Amazon Redshift
  • 47. ● Multiple output nodes from transform step Pipeline 2: Cassandra → Amazon Redshift
  • 48. ● Bootstrap • Save every pipeline definition • Fetching new jar for the Amazon EMR jobs • Specify the same Hadoop / Hive metastore installation Pipeline 2: Cassandra → Amazon Redshift
  • 49. Data products ● We’ve talked about data into the warehouse ● Common pattern: • Wait for dependencies • Computation inside redshift to create derived tables • Amazon EMR activities for more complex process • Load back into MySQL / Cassandra • Product feature queries MySQL / Cassandra ● Used in recommendations, dashboards, and search
  • 50. Recommendations ● Objective • Connecting the learner to right content ● Use cases: • Recommendations email • Course discovery • Reactivation of the users
  • 51. Recommendations ● Computation inside Amazon Redshift to create derived tables for co-enrollments ● Amazon EMR job for model training ● Model file pushed to Amazon S3 ● Prediction API uses the updated model file ● Contract between the prediction and the training layer is via model definition.
  • 52. Internal Dashboard ● Objective • Serve internal dashboards to create a data driven culture ● Use Cases • Key performance indicators for the company • Track results for different A/B experiments
  • 54. ● Do: • Monitoring (run times, retries, deploys, query times) Learnings
  • 55. ● Do: • Monitoring (run times, retries, deploys, query times) • Code should live in library instead of scripts being passed to every pipeline Learnings
  • 56. ● Do: • Monitoring (run times, retries, deploys, query times) • Code should live in library instead of scripts being passed to every pipeline • Test environment in staging should mimic prod Learnings
  • 57. ● Do: • Monitoring (run times, retries, deploys, query times) • Code should live in library instead of scripts being passed to every pipeline • Test environment in staging should mimic prod • Shared library to democratize writing of ETL Learnings
  • 58. ● Do: • Monitoring (run times, retries, deploys, query times) • Code should live in library instead of scripts being passed to every pipeline • Test environment in staging should mimic prod • Shared library to democratize writing of ETL • Using read-replicas and backups Learnings
  • 59. ● Don’t: • Same code passed to multiple pipelines as a script Learnings
  • 60. ● Don’t: • Same code passed to multiple pipelines as a script • Non version controlled pipelines Learnings
  • 61. ● Don’t: • Same code passed to multiple pipelines as a script • Non version controlled pipelines • Really huge pipelines instead of modular small pipelines with dependencies Learnings
  • 62. ● Don’t: • Same code passed to multiple pipelines as a script • Non version controlled pipelines • Really huge pipelines instead of modular small pipelines with dependencies • Not catching resource timeouts or load delays Learnings
  • 63. Dataduct ● Code reuse ● Extensibility ● Command line interface ● Staging environment support ● Dynamic updates
  • 64. Dataduct ● Repository • https://github.com/coursera/dataduct ● Documentation • http://dataduct.readthedocs.org/en/latest/ ● Installation • pip install dataduct
  • 65. Questions? Also, we are hiring! https://www.coursera.org/jobs
  • 67. Thank you! Also, we are hiring! https://www.coursera.org/jobs Sourabh Bajaj sb2nov @sb2nov