Building+serverless+analytics+pipelines+with+AWS+Glue+-+Tom+McMeekin-1
Building+serverless+analytics+pipelines+with+AWS+Glue+-+Tom+McMeekin-1
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
There are more
people accessing data
And more
requirements for
making data available
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
There are more
people accessing data
And more
requirements for
making data available
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data stewardship
Data
engineering
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data stewardship
Data modeling
Data
structures
Data
engineering
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data stewardship
Data modeling
Data
structures
Data
engineering
Data lakes
Data warehouse
Data marts
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data stewardship
Data modeling
Extract
Data
Transform
structures
Load
Data
engineering
Data lakes
Data
pipelines
Data warehouse
Data marts
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue
Serverless data catalog and ETL service
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue crawlers
OLTP
ERP
Automatically build your Data
Catalog and keep it in sync
CRM
AWS Glue Data Catalog
LOB
Built-in classifiers; custom
Devices
classifiers using Grok
Sensors
expression
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue Data Catalog
Amazon QuickSight
Amazon SageMaker
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Use AWS Glue to cleanse, prep, and move
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Apache Spark and AWS Glue ETL
DataFrames DynamicFrame
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DataFrames and DynamicFrames
DataFrames
Core data structure for SparkSQL
Like structured tables
Need schema up front
Each row has same structure
Suited for SQL-like analytics
DynamicFrames
Like DataFrames for ETL
Designed for processing semi-structured data,
e.g., JSON, Avro, Apache logs ...
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Developer endpoints / Notebooks
endpoint
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue: Job execution—serverless
There is no need to provision, configure, or manage servers AWS Glue
• Connect to on-premises
JDBC data stores as source Amazon RDS
Database
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Three ways to orchestrate an AWS Glue ETL pipeline
• Schedule driven
• Event driven
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Schedule driven: Work backwards from a daily SLA
deadline
SLA
Crawl Run Crawl Ready deadline
raw “optimize” optimized for
dataset job dataset reporting
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Event driven: Let Amazon CloudWatch Events and
AWS Lambda drive the pipeline
SLA
Crawl Run Crawl Ready deadline
raw “optimize” optimized for
dataset job dataset reporting
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
State machine driven: Let AWS Step Functions drive
the pipeline
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Demo
Data
DevOps
engineering CI/CD
Configuration
management
Feature flags
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CI/CD for AWS Glue ETL
• Help data engineers write quality code
• Automate the ETL job release management process
• Mitigate risk
AWS CodePipeline
CI/CD for AWS Glue ETL
pipe_line_template.yaml
etl_job.py
live_test.py
AWS CodePipeline
AWS
CodeCommit
CI/CD for AWS Glue ETL
pipe_line_template.yaml
Role
etl_job.py
Amazon S3 Amazon S3
(Raw data) (Test data)
AWS CodePipeline
AWS AWS
CodeCommit CloudFormation
CI/CD for AWS Glue ETL
AWS Glue Data Catalog
Amazon S3 Amazon S3
(Raw data) (Test data)
live_test.py
AWS CodePipeline
Role
etl_job.py
Amazon S3 Amazon S3
(Raw data) (Prod data)
AWS CodePipeline
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Learning Resources
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Learn from AWS experts. Advance your skills and
knowledge. Build your future in the AWS Cloud.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Why work with an APN Partner?
APN Partners are uniquely positioned APN Partners with deep expertise in
to help your organization at any AWS services:
stage of your cloud adoption journey, AWS Managed Service Provider (MSP)
and they:
Partners
• Share your goals—focused on your APN Partners with cloud infrastructure and
success application migration expertise
• Help you take full advantage of all the AWS Competency Partners
business benefits that AWS has to offer APN Partners with verified, vetted, and validated
specialized offerings
• Provide services and solutions to
support any AWS use case across your AWS Service Delivery Partners
full customer life cycle APN Partners with a track record of delivering
specific AWS services to customers
[email protected]
twitter.com/AWSCloud
facebook.com/AmazonWebServices
youtube.com/user/AmazonWebServices
slideshare.net/AmazonWebServices
twitch.tv/aws
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.