SlideShare a Scribd company logo
AWS Glue Technical
Enablement Training
Kyle Escosia
Jr. Data Science Specialist
Info Alchemy
Agenda
AWS Glue Overview
AWS Glue Concepts
AWS Glue Deep Dive Components
AWS Glue Configurations (VPC, Security Groups, VPN, etc.)
Reference Architectures
Recent innovations
Complementary AWS Services (DataBrew, Lake Formation, AWS API)
Data at scale
Growing
exponentially
From new
sources
Increasingly
diverse
Used by
many people
Analyzed by
many applications
Why data
preparation?
Data preparation is the first mile of
Analytics
Business
Intelligence Machine
Learning
Data preparation is hard
Lots of data! Infrastructure
management
Data grows fast 10x
every5years
Data is more diverse
Most jobshand-coded
Brittle and error prone
Machine / instance sizing Cluster
lifecyclemanagement
Scheduling andmonitoring
Managingmetastores
Needs customization
AWS Glue has evolved
Then Now
Fully Managed extract-transform-load
(ETL) Service
For developers, built
by developers
Serverless data preparation service
ETL developers, data engineers, data
scientists, business analysts, and more
SelectAWS Glue
customers
Amazon S3
data lakestorage
Building data
lakes
Break silos, store data in Amazon S3
AWSGlue jobs and workflows to
ingest, process, and refine data instages
Access data lakes viaa
variety of cloud analytic engines
Amazon RDS Other databases On-premises data Streaming data
AWS Gluecrawlers
load and maintain the Data Catalog
AWS Lake Formation permissions to
secure the data lake
AWS Glue Concepts
AWS
Glue
Fully managed, serverless ETLservice
for developers and datascientists
Serverlessreview
No infrastructure provisioning,
no management
Automatic scaling
Pay for value Highly available andsecure
Easily de-duplicate your data with ML
transforms
ETL Jobs
No resources to manage
Charged hourly based on Data Processing Units (DPUs) - $0.44 per hour
provides 4 vCPU and 16 GB of memory
Three types
Apache Spark
Python Shell
Spark Streaming
Data Catalog
Free for the first million objects stored (table, table version, partition, or database)
$1.00 per 100,000 objects stored above 1M, per month
Crawlers
Charged hourly based on Data Processing Units (DPUs)
$0.44 per DPU-Hour, billed per second, with a 10-minute minimum per crawler run
With AWS Glue, you only pay for the time your ETL job takes to run.
AWS Glue Usage and Pricing
AWS Glue Deep Dive Components
Security: IAM Permissions – A refresher
IAM Users
consist of a username and a password
IAM Groups
collection of users
IAM Role
an identity used to delegate access to AWS resources
IAM Service Role
a role that a service assumes to perform actions in your
account on your behalf
IAM Policy
an entity, when attached to an identity, defines their permissions
AWS Glue Permissions
Follow the least privilege access principle
Requires an IAM Role
AWS Managed Policy: AWSGlueServiceRole
Custom Policy – fine-grained access
Some related services
Amazon S3, Amazon Redshift, Amazon CloudWatch
AWS Glue Components
Crawlers
Load andmaintain
Data Catalog
Infer metadata:
schema, table
structure
Supports schema
evolution
AWS GlueData
Catalog
Apache Hive Metastore
compatible
Many integrated
analytic services
Extract,
transform, and load
Serverless execution
Apache Spark / Python
shell jobs
Interactive development
Auto-generate ETLcode
Orchestrate triggers,
crawlers, and jobs
Build and monitor
complex flows
Reliable execution
Workflow
management
AWS Glue is used to cleanse, prep, and
catalog
AWS Glue DataCatalog
Workflows orchestrate dataflows
Process data instages
Crawlers populate/maintain catalog
Jobs execute ETLtransforms
What arecrawlers?
Automatically discover new data and extract schema definitions
detect schema changes and maintain tables detect Apache
Hive style partitions on Amazon S3
Built-in classifiers for popular datatypes
create your own custom classifier using Grok expressions
Run on demand, on a schedule, or as parts of workflows
Crawlers discoverstructure
Handles complex, nested fields
Detects Hive-style partitions
What can crawlers classify?
Use excludepatterns to remove unnecessary files
To ignore all Metadata files in the
folders year=‘2017’ and for
location s3://mydatasets
s3://mydatasets
year=2017/**/METADATA.txt
Improve performance with multiple crawlers
Periodically audit long running crawlers to balance workloads
Often crawlers are processing multiple datasets / tables
Improve performance by using multiple crawlers
Crawler granularity is table or dataset
What is anAWS Glue
job?
An AWS Glue job encapsulates the business logic that
performs extract, transform, and load (ETL)work
• A core building block in your production ETL pipeline
• Provide your PySpark ETL script or have one automatically generated
• Supports a rich set of built-in AWS Glue transformations
• Jobs can be started, stopped,monitored
Under the hood:Apache Spark and AWSGlue
ETL
• Apache Spark is a distributed data processing engine with rich support
for complex analytics
• AWS Glue builds on the Apache Spark runtime to offer ETL-specific
functionality
SparkSQL AWS GlueETL
Spark DataFrames AWS GlueDynamicFrames
Spark Core:RDDs
Apache Spark – What is it?
HDFS
YARN
MapReduce Spark
Cassandra NoSQL
Mesos
Tez
Distributed Storage Layer
Cluster Resource Management
Processing Framework Layer
Let’s try that again..
Think of a Bee Hive as your Distributed Storage
A Bee Hive needs to have a Queen
This Queen, serves as your Spark Driver
The Worker Bees, serves as your worker nodes
Putting it together..
Generates the Spark Context
Main Method
Access to the Resource Manager
Spark Driver
Resource
Manager
Executor
Cache
Executor
Cache
Executor
Cache
Executor
Cache
The Queen
The Worker Bees
DataFrames and DynamicFrames
DataFrames
Core data structure for SparkSQL
Like structured tables
Need schema upfront
Each row has same structure
Suited for SQL-like analytics
DynamicFrames
Like DataFrames forETL
Designed for processing semi-structured
data, e.g., JSON, Avro, Apachelogs
schema per-record, noupfront schema needed
Easy to restructure, tag,modify
Can be more compact than DataFrame rows
Many flows can be done in single pass
Dynamic Frame internals
{“id”:”2489”, “type”: ”CreateEvent”,
”payload”: {“creator”:…}, …}
Dynamic records
type
id type
id
Dynamic Frame schema
type
id
{“id”:4391, “type”: “PullEvent”,
”payload”: {“assets”:…}, …}
type
id
{“id”:”6510”, “type”: “PushEvent”,
”payload”: {“pusher”:…}, …}
id
AWS Glue executionmodel: jobs and stages
Filter
Read
Read
Stage 1
Repartition
Write
Stage 2
Job 1
Stage 1
Job 2
Apply
Mapping
Filter Show
Apply
Mapping
AWS Glue executionmodel: jobs and stages
Filter
Read
Repartition
Write
Read
Job 1
Stage 1
Stage 2
Stage 1
Job 2
Apply
Mapping
Filter Show
Apply
Mapping
Actions
AWS Glue executionmodel: jobs and stages
Filter
Read
Read
Job 1
Stage 1
Repartition
Write
Stage 2
Stage 1
Job 2
Apply
Mapping
Filter Show
Apply
Mapping
Jobs
AWS Glue executionmodel: data partitions
• Apache Spark and AWS Glue
are data parallel.
• Data is divided intopartitions
that are processed
concurrently.
• 1 stage x 1 partition = 1 task
Driver
Executors
Overall throughput islimited
by the number of partitions
Performance bestpractices
• Avoid unnecessary jobs and stages where possible
• Ensure your data can be partitioned to utilize the entire cluster
• Identify resource bottlenecks and pick the best worker type
Performance bestpractices
• Avoid unnecessary jobs and stages where possible
• Ensure your data can be partitioned to utilize the entire cluster
• Identify resource bottlenecks and pick the best worker type
Jobs
Filter
Read
Job 1
Stage 1
Repartition
Write
Stage 2
Apply
Mapping
Read Filter
Apply
Mapping
Job 2
Show
Performance bestpractices
• Avoid unnecessary jobs and stages where possible
• Ensure your data can be partitioned to utilize the entire cluster
• Identify resource bottlenecks and pick the best worker type
• Text – xSV, JSON
• May or may not be compressed
• Human readable whenuncompressed
• Not optimized foranalytics
• Columnar – Parquet & ORC
• Compressed in a binaryformat
• Integrated indexes and stats
• Optimized read performance when selecting only a subset of columns
• Row – Avro
• Compressed in a binaryformat
• Optimized read performance when selecting all columns of a subset of
rows
File formats
Partitioning guidance
• Chose columns that have low cardinality (uniqueness)
• Partitioning on day/month/year has 365 unique values per year
• Partitioning on seconds has millions of values per year
• You can partition on any column, not just date
• For example, s3://abc-corp-sales-data/country=xx/state=xx/bu=xx)
• Look at your query patterns – what data do you want to query, and what do
you want to filter out?
Performance bestpractices
• Avoid unnecessary jobs and stages where possible
• Ensure your data can be partitioned to utilize the entire cluster
• Identify resource bottlenecks and pick the best worker type
Standard
Provide the maximum capacity of DPUs (max. 100)
4 vCPUs of compute capacity and 16 GB of memory, 50 GB disk and 2 executors
G.1X
Provide the number of workers (max. 299)
A Worker maps to 1 DPU (4 vCPU, 16 GB of memory, 64 GB disk) and 1 executor per
worker
Recommended for memory-intensive jobs
G.2X
Provide the number of workers (max. 149)
A Worker maps to 2 DPU (8 vCPU, 32 GB of memory, 128 GB disk) and 1 executor per worker
Recommended for memory-intensive jobs that run ML Transforms
Worker Types
Performance bestpractices
• Avoid unnecessary jobs and stages where possible
• Ensure your data can be partitioned to utilize the entire cluster
• Identify resource bottlenecks and pick the best worker type
• Use G.1X and G.2X instances when your jobs need lots of memory
• Executor memory issues happen most often during sort and shuffle
operations
• The driver most often runs out of memory when processing a very
large number of input partitions
What is anAWS Glue
trigger?
Triggers are the “glue” in your AWS Glue ETL pipeline
Triggers
• Can be used to chain multiple AWS Glue jobs in a series
• Can start multiple jobs atonce
• Can be scheduled, on-demand, or based on job events
• Can pass unique parameters to customize AWS Glue job runs
Three ways to set up anAWS Glue ETL
pipeline
• Schedule-driven
• Event-driven
• State machine–driven
Schedule-drivenAWS Glue ETL
pipeline
We work our way backward from a daily SLA deadline
Event-drivenAWS Glue ETL
pipeline
Let Amazon CloudWatch Events and AWS Lambda drive the pipeline
Example ETL
flow
Create and run a job that will
• Consume data in S3
• Join the data
• Select only the required columns
• Write the results to a data lake on Amazon Simple Storage
Service (AmazonS3)
Monitor the running job Analyze
the resulting dataset
Join Data
Select
Columns
Fill null values
• Fill null values
Goal: prepare and analyze
POS Data
What are workflows and how do they work?
DAGs with triggers, jobs, andcrawlers
Graphical canvas for authoringworkflows
Run / rerun and monitor workflow executions
Share parameters across entities in the workflow
Workflow buildingblocks
Building workflows
Build workflows with:
Graphical canvas
APIs
AWS CloudFormation templates
Monitoring workflows
Easily monitor /see:
workflows running now
completed workflows
status /errors
Track previously processed data
Enable |disable |pause bookmarks onsources
Rollback to a previous state if necessary
Incrementaldata processing with job
bookmarks
Examples uses:
Process POS Data filesdaily
Process log fileshourly
Track timestamps or primary keys in DBs
Track generated foreign keysfor
normalization
Bookmarks are per-job checkpoints that
track the work done in previous runs.
They persist the state of sources,
transforms, and sinks on each run.
run 1 run 2 run 3
Incrementaldata processing withjobbookmarks
Option Behavior
Enable Pick up from where you left off
Disable
Ignore and process the entire
dataset every time
Pause
Temporarily disable advancing the
bookmark
run 1 run 2
enable
disable
pause
run 3
Examples:
Enable: Process the newest githubarchive partition
Disable: Process the entire githubarchivetable
Pause: Process the previous githubarchive partition
Job bookmark options
Job bookmark example
year
…
…
2017
11 12
28
month
day 27
hour …
year
…
…
2017
11 12
28
month
day 27
hour …
Input table
… …
run 1
run 2
…
Output table
Periodically run ajob
avoid reprocessing
previous input
avoid generating
duplicate output
Questions?
Agenda
AWS Glue Overview
AWS Glue Concepts
AWS Glue Deep Dive Components
AWS Glue Configurations (VPC, Security Groups, VPN, etc.)
Reference Architectures
Recent innovations
Complementary AWS Services (DataBrew, Lake Formation, AWS API)
AWS Glue Configurations
Key Concepts
Virtual Private Cloud (VPC)
allows you to specify an IP address range for the VPC, add subnets, associate security
groups, and configure route tables.
Subnet
is a range of IP addresses in your VPC.
Public Subnet
Internet
Private Subnet
No Internet
VPN connection
Virtual Private Gateway (VGW)
Amazon Side
Customer Gateway (CGW)
Physical device on your Corporate Network
Security Groups
controls inbound and outbound traffic for your instances
Accessing on premise network
10.10.10.0/24
Detailed Architecture
AWS VPC
(10.10.0.0/16)
10.10.11.0/24
NAT-GW
IGW
AWS Glue
ENIs: 10.10.10.x
Amazon RDS
VGW
Amazon S3
VPCe
VPN Tunnel CGW
Destination Target
10.10.0.0/16 local
0.0.0.0 NAT-GW-id
Destination Target
10.10.0.0/16 local
0.0.0.0 IGW-id
JDBC Connection
Internet
Destination Target
10.10.0.0/16 local
0.0.0.0 NAT-GW-id
172.31.0.0/16 VGW-id
Questions?
Agenda
AWS Glue Overview
AWS Glue Concepts
AWS Glue Deep Dive Components
AWS Glue Configurations (VPC, Security Groups, VPN, etc.)
Reference Architectures
Recent innovations
Complementary AWS Services (DataBrew, Lake Formation, AWS API)
Reference Architecture
AWS Glue
CPFI Data lake Architecture
Agenda
AWS Glue Overview
AWS Glue Concepts
AWS Glue Deep Dive Components
AWS Glue Configurations (VPC, Security Groups, VPN, etc.)
Reference Architectures
Recent innovations
Complementary AWS Services (DataBrew, Lake Formation, AWS API)
Recent innovations
RecentAWS Glue innovations
Merge/
transition/purge
SageMaker
notebooks
AWS Glue
streaming
Vertical scaling
PartitionIndex
Pause and
resume
workflows
Bahrain
Spark UI
Crawler
performance
Sao Paulo
Custom JDBC
certificates
Milan AWS GlueVPC
sharing
AWS Glue2.0
C-based
libraries
MongoDB
Amazon
DocumentDB
Self-managed
Kafka support
AWS Glue
Studio
Spark 2.4.3
AVRO
support
Continuous
logging
Hong Kong
Resource tags
Python shell
jobs
GovCloud
AWS Glue
workflows
Python 3.7on
Spark Stockholm
Wheel
dependency
Job bookmarks
FindMatches
ML transforms
China Regions
AWS GlueETL
binaries
50+ new features
and regions
AWS Glue 2.0:New engine for real-time
workloads
Cost effective
New job execution engine with a new scheduler
10x faster job start times
Predictable job latencies
Enables micro-batching
Latency-sensitive workloads
Fast and predictable
Diverse workloads
1-minute minimum billing
4 5 % cost savings on average
AWS Glue Studio: New visual ETL
interface
M A K E S I T E A S Y TO A U T H O R , R U N , A N D M O N I TO R AW S G L U E E T L J O B S
Author AWS Glue jobs visually without coding
Monitor 1000s of jobs through a single pane of
glass
Distributed processing without the learning curve
Advanced transforms through code snippets
Agenda
AWS Glue Overview
AWS Glue Concepts
AWS Glue Deep Dive Components
AWS Glue Configurations (VPC, Security Groups, VPN, etc.)
Reference Architectures
Recent innovations
Complementary AWS Services
Complementary AWS Services
AWS Glue DataBrew
V I S U A L D ATA P R E PA R AT I O N F O R A N A LY T I C S A N D M A C H I N E L E A R N I N G
GenerallyAvailable!
AmazonManagedWorkflowsforApacheAirflow
H I G H LY AVA I L A B L E , S E C U R E , A N D M A N A G E D W O R K F LO W O R C H E S T R AT I O N F O R
A PA C H E A I R F LO W
Preview
AWSLake Formation
Build a secure data lake in days
Simplify security
management
Centrally define security,governance
and auditing policies
Enforce policiesconsistently
across multiple services
Integrates with IAM andKMS
Provide self-service
access to data
Build a data catalogthat
describes your data
Enable analysts and datascientists
to easily find relevantdata
Analyze with multipleanalytics
services without moving data
Build datalakes
quickly
Move, store, catalog, and clean
your data faster
Transform to openformats
like Parquet and ORC
ML-based deduplication
and recordmatching
AWS API
Boto3 for Python
https://boto3.amazonaws.com
/v1/documentation/api/latest/
guide/index.html
Examples:
Upload files to S3
Download files from S3
Run a Glue Job
Run a Workflow
Thank you!
Kyle Escosia
kescosia@info-alchemy.net
Ad

More Related Content

What's hot (20)

AWS Control Tower를 통한 클라우드 보안 및 거버넌스 설계 - 김학민 :: AWS 클라우드 마이그레이션 온라인
AWS Control Tower를 통한 클라우드 보안 및 거버넌스 설계 - 김학민 :: AWS 클라우드 마이그레이션 온라인AWS Control Tower를 통한 클라우드 보안 및 거버넌스 설계 - 김학민 :: AWS 클라우드 마이그레이션 온라인
AWS Control Tower를 통한 클라우드 보안 및 거버넌스 설계 - 김학민 :: AWS 클라우드 마이그레이션 온라인
Amazon Web Services Korea
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
James Serra
 
AWS Monitoring & Logging
AWS Monitoring & LoggingAWS Monitoring & Logging
AWS Monitoring & Logging
Jason Poley
 
Introduction to Azure Data Factory
Introduction to Azure Data FactoryIntroduction to Azure Data Factory
Introduction to Azure Data Factory
Slava Kokaev
 
Azure Data Factory v2
Azure Data Factory v2Azure Data Factory v2
Azure Data Factory v2
inovex GmbH
 
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
Amazon Web Services Korea
 
Amazon SageMaker 모델 배포 방법 소개::김대근, AI/ML 스페셜리스트 솔루션즈 아키텍트, AWS::AWS AIML 스페셜 웨비나
Amazon SageMaker 모델 배포 방법 소개::김대근, AI/ML 스페셜리스트 솔루션즈 아키텍트, AWS::AWS AIML 스페셜 웨비나Amazon SageMaker 모델 배포 방법 소개::김대근, AI/ML 스페셜리스트 솔루션즈 아키텍트, AWS::AWS AIML 스페셜 웨비나
Amazon SageMaker 모델 배포 방법 소개::김대근, AI/ML 스페셜리스트 솔루션즈 아키텍트, AWS::AWS AIML 스페셜 웨비나
Amazon Web Services Korea
 
[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue
Amazon Web Services Korea
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
James Serra
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
Yousun Jeong
 
실시간 스트리밍 분석 Kinesis Data Analytics Deep Dive
실시간 스트리밍 분석  Kinesis Data Analytics Deep Dive실시간 스트리밍 분석  Kinesis Data Analytics Deep Dive
실시간 스트리밍 분석 Kinesis Data Analytics Deep Dive
Amazon Web Services Korea
 
사례로 알아보는 Database Migration Service : 데이터베이스 및 데이터 이관, 통합, 분리, 분석의 도구 - 발표자: ...
사례로 알아보는 Database Migration Service : 데이터베이스 및 데이터 이관, 통합, 분리, 분석의 도구 - 발표자: ...사례로 알아보는 Database Migration Service : 데이터베이스 및 데이터 이관, 통합, 분리, 분석의 도구 - 발표자: ...
사례로 알아보는 Database Migration Service : 데이터베이스 및 데이터 이관, 통합, 분리, 분석의 도구 - 발표자: ...
Amazon Web Services Korea
 
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
Amazon Web Services Korea
 
Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Amazon Web Services Korea
 
20190814 AWS Black Belt Online Seminar AWS Serverless Application Model
20190814 AWS Black Belt Online Seminar AWS Serverless Application Model  20190814 AWS Black Belt Online Seminar AWS Serverless Application Model
20190814 AWS Black Belt Online Seminar AWS Serverless Application Model
Amazon Web Services Japan
 
Cloudwatch: Monitoring your Services with Metrics and Alarms
Cloudwatch: Monitoring your Services with Metrics and AlarmsCloudwatch: Monitoring your Services with Metrics and Alarms
Cloudwatch: Monitoring your Services with Metrics and Alarms
Felipe
 
Introduction to azure cosmos db
Introduction to azure cosmos dbIntroduction to azure cosmos db
Introduction to azure cosmos db
Ratan Parai
 
Building Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache KafkaBuilding Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache Kafka
Guido Schmutz
 
Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFW
Kent Graziano
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
AWS Control Tower를 통한 클라우드 보안 및 거버넌스 설계 - 김학민 :: AWS 클라우드 마이그레이션 온라인
AWS Control Tower를 통한 클라우드 보안 및 거버넌스 설계 - 김학민 :: AWS 클라우드 마이그레이션 온라인AWS Control Tower를 통한 클라우드 보안 및 거버넌스 설계 - 김학민 :: AWS 클라우드 마이그레이션 온라인
AWS Control Tower를 통한 클라우드 보안 및 거버넌스 설계 - 김학민 :: AWS 클라우드 마이그레이션 온라인
Amazon Web Services Korea
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
James Serra
 
AWS Monitoring & Logging
AWS Monitoring & LoggingAWS Monitoring & Logging
AWS Monitoring & Logging
Jason Poley
 
Introduction to Azure Data Factory
Introduction to Azure Data FactoryIntroduction to Azure Data Factory
Introduction to Azure Data Factory
Slava Kokaev
 
Azure Data Factory v2
Azure Data Factory v2Azure Data Factory v2
Azure Data Factory v2
inovex GmbH
 
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
Amazon Web Services Korea
 
Amazon SageMaker 모델 배포 방법 소개::김대근, AI/ML 스페셜리스트 솔루션즈 아키텍트, AWS::AWS AIML 스페셜 웨비나
Amazon SageMaker 모델 배포 방법 소개::김대근, AI/ML 스페셜리스트 솔루션즈 아키텍트, AWS::AWS AIML 스페셜 웨비나Amazon SageMaker 모델 배포 방법 소개::김대근, AI/ML 스페셜리스트 솔루션즈 아키텍트, AWS::AWS AIML 스페셜 웨비나
Amazon SageMaker 모델 배포 방법 소개::김대근, AI/ML 스페셜리스트 솔루션즈 아키텍트, AWS::AWS AIML 스페셜 웨비나
Amazon Web Services Korea
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
James Serra
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
Yousun Jeong
 
실시간 스트리밍 분석 Kinesis Data Analytics Deep Dive
실시간 스트리밍 분석  Kinesis Data Analytics Deep Dive실시간 스트리밍 분석  Kinesis Data Analytics Deep Dive
실시간 스트리밍 분석 Kinesis Data Analytics Deep Dive
Amazon Web Services Korea
 
사례로 알아보는 Database Migration Service : 데이터베이스 및 데이터 이관, 통합, 분리, 분석의 도구 - 발표자: ...
사례로 알아보는 Database Migration Service : 데이터베이스 및 데이터 이관, 통합, 분리, 분석의 도구 - 발표자: ...사례로 알아보는 Database Migration Service : 데이터베이스 및 데이터 이관, 통합, 분리, 분석의 도구 - 발표자: ...
사례로 알아보는 Database Migration Service : 데이터베이스 및 데이터 이관, 통합, 분리, 분석의 도구 - 발표자: ...
Amazon Web Services Korea
 
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
Amazon Web Services Korea
 
Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Amazon Web Services Korea
 
20190814 AWS Black Belt Online Seminar AWS Serverless Application Model
20190814 AWS Black Belt Online Seminar AWS Serverless Application Model  20190814 AWS Black Belt Online Seminar AWS Serverless Application Model
20190814 AWS Black Belt Online Seminar AWS Serverless Application Model
Amazon Web Services Japan
 
Cloudwatch: Monitoring your Services with Metrics and Alarms
Cloudwatch: Monitoring your Services with Metrics and AlarmsCloudwatch: Monitoring your Services with Metrics and Alarms
Cloudwatch: Monitoring your Services with Metrics and Alarms
Felipe
 
Introduction to azure cosmos db
Introduction to azure cosmos dbIntroduction to azure cosmos db
Introduction to azure cosmos db
Ratan Parai
 
Building Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache KafkaBuilding Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache Kafka
Guido Schmutz
 
Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFW
Kent Graziano
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 

Similar to AWS glue technical enablement training (10)

DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
Rustem Feyzkhanov
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
Azure Databricks - An Introduction 2019 Roadshow.pptx
Azure Databricks - An Introduction 2019 Roadshow.pptxAzure Databricks - An Introduction 2019 Roadshow.pptx
Azure Databricks - An Introduction 2019 Roadshow.pptx
pascalsegoul
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
javier ramirez
 
What is AWS Glue
What is AWS GlueWhat is AWS Glue
What is AWS Glue
jeetendra mandal
 
Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWSMigrating enterprise workloads to AWS
Migrating enterprise workloads to AWS
Tom Laszewski
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
Amazon Web Services Korea
 
Azure Synapse Overview for data analytics
Azure Synapse Overview for data analyticsAzure Synapse Overview for data analytics
Azure Synapse Overview for data analytics
EkanshGirdhar1
 
AWS Big Data Landscape
AWS Big Data LandscapeAWS Big Data Landscape
AWS Big Data Landscape
Crishantha Nanayakkara
 
AWS Certified Solutions Architect Professional Course S15-S18
AWS Certified Solutions Architect Professional Course S15-S18AWS Certified Solutions Architect Professional Course S15-S18
AWS Certified Solutions Architect Professional Course S15-S18
Neal Davis
 
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
Rustem Feyzkhanov
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
Azure Databricks - An Introduction 2019 Roadshow.pptx
Azure Databricks - An Introduction 2019 Roadshow.pptxAzure Databricks - An Introduction 2019 Roadshow.pptx
Azure Databricks - An Introduction 2019 Roadshow.pptx
pascalsegoul
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
javier ramirez
 
Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWSMigrating enterprise workloads to AWS
Migrating enterprise workloads to AWS
Tom Laszewski
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
Amazon Web Services Korea
 
Azure Synapse Overview for data analytics
Azure Synapse Overview for data analyticsAzure Synapse Overview for data analytics
Azure Synapse Overview for data analytics
EkanshGirdhar1
 
AWS Certified Solutions Architect Professional Course S15-S18
AWS Certified Solutions Architect Professional Course S15-S18AWS Certified Solutions Architect Professional Course S15-S18
AWS Certified Solutions Architect Professional Course S15-S18
Neal Davis
 
Ad

Recently uploaded (20)

Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
Ad

AWS glue technical enablement training

  • 1. AWS Glue Technical Enablement Training Kyle Escosia Jr. Data Science Specialist Info Alchemy
  • 2. Agenda AWS Glue Overview AWS Glue Concepts AWS Glue Deep Dive Components AWS Glue Configurations (VPC, Security Groups, VPN, etc.) Reference Architectures Recent innovations Complementary AWS Services (DataBrew, Lake Formation, AWS API)
  • 3. Data at scale Growing exponentially From new sources Increasingly diverse Used by many people Analyzed by many applications
  • 4. Why data preparation? Data preparation is the first mile of Analytics Business Intelligence Machine Learning
  • 5. Data preparation is hard Lots of data! Infrastructure management Data grows fast 10x every5years Data is more diverse Most jobshand-coded Brittle and error prone Machine / instance sizing Cluster lifecyclemanagement Scheduling andmonitoring Managingmetastores Needs customization
  • 6. AWS Glue has evolved Then Now Fully Managed extract-transform-load (ETL) Service For developers, built by developers Serverless data preparation service ETL developers, data engineers, data scientists, business analysts, and more
  • 8. Amazon S3 data lakestorage Building data lakes Break silos, store data in Amazon S3 AWSGlue jobs and workflows to ingest, process, and refine data instages Access data lakes viaa variety of cloud analytic engines Amazon RDS Other databases On-premises data Streaming data AWS Gluecrawlers load and maintain the Data Catalog AWS Lake Formation permissions to secure the data lake
  • 10. AWS Glue Fully managed, serverless ETLservice for developers and datascientists
  • 11. Serverlessreview No infrastructure provisioning, no management Automatic scaling Pay for value Highly available andsecure
  • 12. Easily de-duplicate your data with ML transforms
  • 13. ETL Jobs No resources to manage Charged hourly based on Data Processing Units (DPUs) - $0.44 per hour provides 4 vCPU and 16 GB of memory Three types Apache Spark Python Shell Spark Streaming Data Catalog Free for the first million objects stored (table, table version, partition, or database) $1.00 per 100,000 objects stored above 1M, per month Crawlers Charged hourly based on Data Processing Units (DPUs) $0.44 per DPU-Hour, billed per second, with a 10-minute minimum per crawler run With AWS Glue, you only pay for the time your ETL job takes to run. AWS Glue Usage and Pricing
  • 14. AWS Glue Deep Dive Components
  • 15. Security: IAM Permissions – A refresher IAM Users consist of a username and a password IAM Groups collection of users IAM Role an identity used to delegate access to AWS resources IAM Service Role a role that a service assumes to perform actions in your account on your behalf IAM Policy an entity, when attached to an identity, defines their permissions
  • 16. AWS Glue Permissions Follow the least privilege access principle Requires an IAM Role AWS Managed Policy: AWSGlueServiceRole Custom Policy – fine-grained access Some related services Amazon S3, Amazon Redshift, Amazon CloudWatch
  • 17. AWS Glue Components Crawlers Load andmaintain Data Catalog Infer metadata: schema, table structure Supports schema evolution AWS GlueData Catalog Apache Hive Metastore compatible Many integrated analytic services Extract, transform, and load Serverless execution Apache Spark / Python shell jobs Interactive development Auto-generate ETLcode Orchestrate triggers, crawlers, and jobs Build and monitor complex flows Reliable execution Workflow management
  • 18. AWS Glue is used to cleanse, prep, and catalog AWS Glue DataCatalog Workflows orchestrate dataflows Process data instages Crawlers populate/maintain catalog Jobs execute ETLtransforms
  • 19. What arecrawlers? Automatically discover new data and extract schema definitions detect schema changes and maintain tables detect Apache Hive style partitions on Amazon S3 Built-in classifiers for popular datatypes create your own custom classifier using Grok expressions Run on demand, on a schedule, or as parts of workflows
  • 20. Crawlers discoverstructure Handles complex, nested fields Detects Hive-style partitions
  • 21. What can crawlers classify?
  • 22. Use excludepatterns to remove unnecessary files To ignore all Metadata files in the folders year=‘2017’ and for location s3://mydatasets s3://mydatasets year=2017/**/METADATA.txt
  • 23. Improve performance with multiple crawlers Periodically audit long running crawlers to balance workloads Often crawlers are processing multiple datasets / tables Improve performance by using multiple crawlers Crawler granularity is table or dataset
  • 24. What is anAWS Glue job? An AWS Glue job encapsulates the business logic that performs extract, transform, and load (ETL)work • A core building block in your production ETL pipeline • Provide your PySpark ETL script or have one automatically generated • Supports a rich set of built-in AWS Glue transformations • Jobs can be started, stopped,monitored
  • 25. Under the hood:Apache Spark and AWSGlue ETL • Apache Spark is a distributed data processing engine with rich support for complex analytics • AWS Glue builds on the Apache Spark runtime to offer ETL-specific functionality SparkSQL AWS GlueETL Spark DataFrames AWS GlueDynamicFrames Spark Core:RDDs
  • 26. Apache Spark – What is it? HDFS YARN MapReduce Spark Cassandra NoSQL Mesos Tez Distributed Storage Layer Cluster Resource Management Processing Framework Layer
  • 27. Let’s try that again.. Think of a Bee Hive as your Distributed Storage A Bee Hive needs to have a Queen This Queen, serves as your Spark Driver The Worker Bees, serves as your worker nodes
  • 28. Putting it together.. Generates the Spark Context Main Method Access to the Resource Manager Spark Driver Resource Manager Executor Cache Executor Cache Executor Cache Executor Cache The Queen The Worker Bees
  • 29. DataFrames and DynamicFrames DataFrames Core data structure for SparkSQL Like structured tables Need schema upfront Each row has same structure Suited for SQL-like analytics DynamicFrames Like DataFrames forETL Designed for processing semi-structured data, e.g., JSON, Avro, Apachelogs
  • 30. schema per-record, noupfront schema needed Easy to restructure, tag,modify Can be more compact than DataFrame rows Many flows can be done in single pass Dynamic Frame internals {“id”:”2489”, “type”: ”CreateEvent”, ”payload”: {“creator”:…}, …} Dynamic records type id type id Dynamic Frame schema type id {“id”:4391, “type”: “PullEvent”, ”payload”: {“assets”:…}, …} type id {“id”:”6510”, “type”: “PushEvent”, ”payload”: {“pusher”:…}, …} id
  • 31. AWS Glue executionmodel: jobs and stages Filter Read Read Stage 1 Repartition Write Stage 2 Job 1 Stage 1 Job 2 Apply Mapping Filter Show Apply Mapping
  • 32. AWS Glue executionmodel: jobs and stages Filter Read Repartition Write Read Job 1 Stage 1 Stage 2 Stage 1 Job 2 Apply Mapping Filter Show Apply Mapping Actions
  • 33. AWS Glue executionmodel: jobs and stages Filter Read Read Job 1 Stage 1 Repartition Write Stage 2 Stage 1 Job 2 Apply Mapping Filter Show Apply Mapping Jobs
  • 34. AWS Glue executionmodel: data partitions • Apache Spark and AWS Glue are data parallel. • Data is divided intopartitions that are processed concurrently. • 1 stage x 1 partition = 1 task Driver Executors Overall throughput islimited by the number of partitions
  • 35. Performance bestpractices • Avoid unnecessary jobs and stages where possible • Ensure your data can be partitioned to utilize the entire cluster • Identify resource bottlenecks and pick the best worker type
  • 36. Performance bestpractices • Avoid unnecessary jobs and stages where possible • Ensure your data can be partitioned to utilize the entire cluster • Identify resource bottlenecks and pick the best worker type Jobs Filter Read Job 1 Stage 1 Repartition Write Stage 2 Apply Mapping Read Filter Apply Mapping Job 2 Show
  • 37. Performance bestpractices • Avoid unnecessary jobs and stages where possible • Ensure your data can be partitioned to utilize the entire cluster • Identify resource bottlenecks and pick the best worker type
  • 38. • Text – xSV, JSON • May or may not be compressed • Human readable whenuncompressed • Not optimized foranalytics • Columnar – Parquet & ORC • Compressed in a binaryformat • Integrated indexes and stats • Optimized read performance when selecting only a subset of columns • Row – Avro • Compressed in a binaryformat • Optimized read performance when selecting all columns of a subset of rows File formats
  • 39. Partitioning guidance • Chose columns that have low cardinality (uniqueness) • Partitioning on day/month/year has 365 unique values per year • Partitioning on seconds has millions of values per year • You can partition on any column, not just date • For example, s3://abc-corp-sales-data/country=xx/state=xx/bu=xx) • Look at your query patterns – what data do you want to query, and what do you want to filter out?
  • 40. Performance bestpractices • Avoid unnecessary jobs and stages where possible • Ensure your data can be partitioned to utilize the entire cluster • Identify resource bottlenecks and pick the best worker type
  • 41. Standard Provide the maximum capacity of DPUs (max. 100) 4 vCPUs of compute capacity and 16 GB of memory, 50 GB disk and 2 executors G.1X Provide the number of workers (max. 299) A Worker maps to 1 DPU (4 vCPU, 16 GB of memory, 64 GB disk) and 1 executor per worker Recommended for memory-intensive jobs G.2X Provide the number of workers (max. 149) A Worker maps to 2 DPU (8 vCPU, 32 GB of memory, 128 GB disk) and 1 executor per worker Recommended for memory-intensive jobs that run ML Transforms Worker Types
  • 42. Performance bestpractices • Avoid unnecessary jobs and stages where possible • Ensure your data can be partitioned to utilize the entire cluster • Identify resource bottlenecks and pick the best worker type • Use G.1X and G.2X instances when your jobs need lots of memory • Executor memory issues happen most often during sort and shuffle operations • The driver most often runs out of memory when processing a very large number of input partitions
  • 43. What is anAWS Glue trigger? Triggers are the “glue” in your AWS Glue ETL pipeline Triggers • Can be used to chain multiple AWS Glue jobs in a series • Can start multiple jobs atonce • Can be scheduled, on-demand, or based on job events • Can pass unique parameters to customize AWS Glue job runs
  • 44. Three ways to set up anAWS Glue ETL pipeline • Schedule-driven • Event-driven • State machine–driven
  • 45. Schedule-drivenAWS Glue ETL pipeline We work our way backward from a daily SLA deadline
  • 46. Event-drivenAWS Glue ETL pipeline Let Amazon CloudWatch Events and AWS Lambda drive the pipeline
  • 47. Example ETL flow Create and run a job that will • Consume data in S3 • Join the data • Select only the required columns • Write the results to a data lake on Amazon Simple Storage Service (AmazonS3) Monitor the running job Analyze the resulting dataset Join Data Select Columns Fill null values • Fill null values Goal: prepare and analyze POS Data
  • 48. What are workflows and how do they work? DAGs with triggers, jobs, andcrawlers Graphical canvas for authoringworkflows Run / rerun and monitor workflow executions Share parameters across entities in the workflow
  • 50. Building workflows Build workflows with: Graphical canvas APIs AWS CloudFormation templates
  • 51. Monitoring workflows Easily monitor /see: workflows running now completed workflows status /errors
  • 52. Track previously processed data Enable |disable |pause bookmarks onsources Rollback to a previous state if necessary Incrementaldata processing with job bookmarks
  • 53. Examples uses: Process POS Data filesdaily Process log fileshourly Track timestamps or primary keys in DBs Track generated foreign keysfor normalization Bookmarks are per-job checkpoints that track the work done in previous runs. They persist the state of sources, transforms, and sinks on each run. run 1 run 2 run 3 Incrementaldata processing withjobbookmarks
  • 54. Option Behavior Enable Pick up from where you left off Disable Ignore and process the entire dataset every time Pause Temporarily disable advancing the bookmark run 1 run 2 enable disable pause run 3 Examples: Enable: Process the newest githubarchive partition Disable: Process the entire githubarchivetable Pause: Process the previous githubarchive partition Job bookmark options
  • 55. Job bookmark example year … … 2017 11 12 28 month day 27 hour … year … … 2017 11 12 28 month day 27 hour … Input table … … run 1 run 2 … Output table Periodically run ajob avoid reprocessing previous input avoid generating duplicate output
  • 57. Agenda AWS Glue Overview AWS Glue Concepts AWS Glue Deep Dive Components AWS Glue Configurations (VPC, Security Groups, VPN, etc.) Reference Architectures Recent innovations Complementary AWS Services (DataBrew, Lake Formation, AWS API)
  • 59. Key Concepts Virtual Private Cloud (VPC) allows you to specify an IP address range for the VPC, add subnets, associate security groups, and configure route tables. Subnet is a range of IP addresses in your VPC. Public Subnet Internet Private Subnet No Internet VPN connection Virtual Private Gateway (VGW) Amazon Side Customer Gateway (CGW) Physical device on your Corporate Network Security Groups controls inbound and outbound traffic for your instances
  • 61. 10.10.10.0/24 Detailed Architecture AWS VPC (10.10.0.0/16) 10.10.11.0/24 NAT-GW IGW AWS Glue ENIs: 10.10.10.x Amazon RDS VGW Amazon S3 VPCe VPN Tunnel CGW Destination Target 10.10.0.0/16 local 0.0.0.0 NAT-GW-id Destination Target 10.10.0.0/16 local 0.0.0.0 IGW-id JDBC Connection Internet Destination Target 10.10.0.0/16 local 0.0.0.0 NAT-GW-id 172.31.0.0/16 VGW-id
  • 63. Agenda AWS Glue Overview AWS Glue Concepts AWS Glue Deep Dive Components AWS Glue Configurations (VPC, Security Groups, VPN, etc.) Reference Architectures Recent innovations Complementary AWS Services (DataBrew, Lake Formation, AWS API)
  • 65. CPFI Data lake Architecture
  • 66. Agenda AWS Glue Overview AWS Glue Concepts AWS Glue Deep Dive Components AWS Glue Configurations (VPC, Security Groups, VPN, etc.) Reference Architectures Recent innovations Complementary AWS Services (DataBrew, Lake Formation, AWS API)
  • 68. RecentAWS Glue innovations Merge/ transition/purge SageMaker notebooks AWS Glue streaming Vertical scaling PartitionIndex Pause and resume workflows Bahrain Spark UI Crawler performance Sao Paulo Custom JDBC certificates Milan AWS GlueVPC sharing AWS Glue2.0 C-based libraries MongoDB Amazon DocumentDB Self-managed Kafka support AWS Glue Studio Spark 2.4.3 AVRO support Continuous logging Hong Kong Resource tags Python shell jobs GovCloud AWS Glue workflows Python 3.7on Spark Stockholm Wheel dependency Job bookmarks FindMatches ML transforms China Regions AWS GlueETL binaries 50+ new features and regions
  • 69. AWS Glue 2.0:New engine for real-time workloads Cost effective New job execution engine with a new scheduler 10x faster job start times Predictable job latencies Enables micro-batching Latency-sensitive workloads Fast and predictable Diverse workloads 1-minute minimum billing 4 5 % cost savings on average
  • 70. AWS Glue Studio: New visual ETL interface M A K E S I T E A S Y TO A U T H O R , R U N , A N D M O N I TO R AW S G L U E E T L J O B S Author AWS Glue jobs visually without coding Monitor 1000s of jobs through a single pane of glass Distributed processing without the learning curve Advanced transforms through code snippets
  • 71. Agenda AWS Glue Overview AWS Glue Concepts AWS Glue Deep Dive Components AWS Glue Configurations (VPC, Security Groups, VPN, etc.) Reference Architectures Recent innovations Complementary AWS Services
  • 73. AWS Glue DataBrew V I S U A L D ATA P R E PA R AT I O N F O R A N A LY T I C S A N D M A C H I N E L E A R N I N G GenerallyAvailable!
  • 74. AmazonManagedWorkflowsforApacheAirflow H I G H LY AVA I L A B L E , S E C U R E , A N D M A N A G E D W O R K F LO W O R C H E S T R AT I O N F O R A PA C H E A I R F LO W Preview
  • 75. AWSLake Formation Build a secure data lake in days Simplify security management Centrally define security,governance and auditing policies Enforce policiesconsistently across multiple services Integrates with IAM andKMS Provide self-service access to data Build a data catalogthat describes your data Enable analysts and datascientists to easily find relevantdata Analyze with multipleanalytics services without moving data Build datalakes quickly Move, store, catalog, and clean your data faster Transform to openformats like Parquet and ORC ML-based deduplication and recordmatching
  • 76. AWS API Boto3 for Python https://boto3.amazonaws.com /v1/documentation/api/latest/ guide/index.html Examples: Upload files to S3 Download files from S3 Run a Glue Job Run a Workflow