0% found this document useful (0 votes)
20 views

Building+serverless+analytics+pipelines+with+AWS+Glue+-+Tom+McMeekin-1

The document discusses building serverless analytics pipelines using AWS Glue, highlighting its capabilities for data cataloging, ETL processes, and integration with various AWS services. It emphasizes the importance of data stewardship, engineering, and the use of CI/CD practices for managing ETL jobs. Additionally, it outlines different orchestration methods for AWS Glue ETL pipelines and provides resources for further learning.

Uploaded by

Sentinel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Building+serverless+analytics+pipelines+with+AWS+Glue+-+Tom+McMeekin-1

The document discusses building serverless analytics pipelines using AWS Glue, highlighting its capabilities for data cataloging, ETL processes, and integration with various AWS services. It emphasizes the importance of data stewardship, engineering, and the use of CI/CD practices for managing ETL jobs. Additionally, it outlines different orchestration methods for AWS Glue ETL pipelines and provides resources for further learning.

Uploaded by

Sentinel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Building serverless analytics pipelines

with AWS Glue


Tom McMeekin
Solutions Architect, AWS

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
There are more
people accessing data

And more
requirements for
making data available

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
There are more
people accessing data

And more
requirements for
making data available

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data stewardship

Data
engineering

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data stewardship

Data modeling

Data
structures

Data
engineering

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data stewardship

Data modeling

Data
structures

Data
engineering
Data lakes

Data warehouse

Data marts

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data stewardship

Data modeling

Extract
Data
Transform
structures
Load

Data
engineering
Data lakes
Data
pipelines
Data warehouse

Data marts

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue
Serverless data catalog and ETL service

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue crawlers

OLTP

ERP
Automatically build your Data
Catalog and keep it in sync
CRM
AWS Glue Data Catalog
LOB
Built-in classifiers; custom
Devices
classifiers using Grok
Sensors
expression

Web Run ad hoc or on a


Social schedule; serverless
Amazon S3 data lake storage

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue Data Catalog

Search metadata for Amazon Athena


data discovery AWS Glue Data Catalog

Amazon QuickSight

Single view across Amazon Redshift

all users, accounts,


and workloads
Amazon EMR
Amazon S3 data lake storage

Amazon SageMaker

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Use AWS Glue to cleanse, prep, and move

AWS Glue Data Catalog Serverless Apache Spark or Python


environment

Auto-generate, write, or bring your


own Python or Scala code
Amazon S3 Amazon S3 Amazon S3
(Raw data) (Staging (Processed data)
data)

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Apache Spark and AWS Glue ETL

Apache SparkSQL AWS Glue ETL

Apache Spark AWS Glue

DataFrames DynamicFrame

Apache Spark Core: RDDs

Apache Spark is a distributed data processing engine for complex analytics


AWS Glue builds on Apache Spark to offer ETL-specific functionality

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DataFrames and DynamicFrames
DataFrames
Core data structure for SparkSQL
Like structured tables
Need schema up front
Each row has same structure
Suited for SQL-like analytics

DynamicFrames
Like DataFrames for ETL
Designed for processing semi-structured data,
e.g., JSON, Avro, Apache logs ...

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Developer endpoints / Notebooks

Connect your IDE to an AWS Glue


AWS Glue development Raw dataset Data Catalog

endpoint

Environment to Amazon SageMaker


interactively develop, Notebook

debug, and test ETL code Optimized dataset

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue: Job execution—serverless
There is no need to provision, configure, or manage servers AWS Glue

• Specify the capacity that


gets allocated to each job

• Pay only for the resources


you consume

• Auto-configure VPC and


role-based access VPC Corporate data center

AWS Direct Connect

• Connect to on-premises
JDBC data stores as source Amazon RDS
Database

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Three ways to orchestrate an AWS Glue ETL pipeline

• Schedule driven

• Event driven

• State machine driven

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Schedule driven: Work backwards from a daily SLA
deadline

SLA
Crawl Run Crawl Ready deadline
raw “optimize” optimized for
dataset job dataset reporting

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Event driven: Let Amazon CloudWatch Events and
AWS Lambda drive the pipeline

SLA
Crawl Run Crawl Ready deadline
raw “optimize” optimized for
dataset job dataset reporting

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
State machine driven: Let AWS Step Functions drive
the pipeline

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Demo

Explore correlations among online


user engagement metrics,
forecasted sales revenue, and
opportunities.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos
engineering
Canary
deployments

Data
DevOps
engineering CI/CD
Configuration
management
Feature flags

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CI/CD for AWS Glue ETL
• Help data engineers write quality code
• Automate the ETL job release management process
• Mitigate risk
AWS CodePipeline
CI/CD for AWS Glue ETL

pipe_line_template.yaml
etl_job.py
live_test.py

AWS CodePipeline

AWS
CodeCommit
CI/CD for AWS Glue ETL
pipe_line_template.yaml

Role

etl_job.py

Amazon S3 Amazon S3
(Raw data) (Test data)

AWS CodePipeline

AWS AWS
CodeCommit CloudFormation
CI/CD for AWS Glue ETL
AWS Glue Data Catalog

Amazon S3 Amazon S3
(Raw data) (Test data)

live_test.py

AWS AWS AWS


CodeCommit CloudFormation CodeBuild
CI/CD for AWS Glue ETL
Amazon S3 Amazon S3
(Data lake) (Test data)

SELECT count(*) FROM ”sales".”data_lake”;



SELECT count(*) FROM ”sales_parquet".”test_data"; Amazon Athena

AWS CodePipeline

AWS AWS AWS


CodeCommit CloudFormation CodeBuild
CI/CD for AWS Glue ETL
pipe_line_template.yaml

Role

etl_job.py

Amazon S3 Amazon S3
(Raw data) (Prod data)

AWS CodePipeline

AWS AWS AWS AWS


CodeCommit CloudFormation CodeBuild CloudFormation
Go learn

• Remember the three steps to build a serverless data pipeline

• Use AWS Glue features

• Leverage the breadth of the AWS offering

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Learning Resources

AWS Glue Implementing CI/CD Orchestrate ETL jobs with


Learning Resources for AWS GLT ETL AWS Step Functions

https://amzn.to/2XC5y6Y https://amzn.to/2NlVvwZ https://amzn.to/2TmL3bL

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Learn from AWS experts. Advance your skills and
knowledge. Build your future in the AWS Cloud.

Digital Training Classroom Training AWS Certification


Free, self-paced online Classes taught by accredited Exams to validate expertise
courses built by AWS AWS instructors with an industry-recognized
experts credential
Ready to begin building your cloud skills?
Get started at: https://www.aws.training/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Why work with an APN Partner?
APN Partners are uniquely positioned APN Partners with deep expertise in
to help your organization at any AWS services:
stage of your cloud adoption journey, AWS Managed Service Provider (MSP)
and they:
Partners
• Share your goals—focused on your APN Partners with cloud infrastructure and
success application migration expertise

• Help you take full advantage of all the AWS Competency Partners
business benefits that AWS has to offer APN Partners with verified, vetted, and validated
specialized offerings
• Provide services and solutions to
support any AWS use case across your AWS Service Delivery Partners
full customer life cycle APN Partners with a track record of delivering
specific AWS services to customers

Find the right APN Partner for your needs: https://aws.amazon.com/partners/find/


© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you for attending AWS Innovate
We hope you found it interesting! A kind reminder to complete the survey.
Let us know what you thought of today’s event and how we can improve the event
experience for you in the future.

[email protected]
twitter.com/AWSCloud
facebook.com/AmazonWebServices
youtube.com/user/AmazonWebServices
slideshare.net/AmazonWebServices
twitch.tv/aws
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

You might also like