58076778-Node Javier Ramirez - AWS PDF
58076778-Node Javier Ramirez - AWS PDF
on AWS
Javier Ramirez
AWS Tech Evangelist
@supercoco9
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
To Become a Leader, Data is Your Differentiator
Leaders Followers
*Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Streaming is hard
Duplicating batch/stream is inefficient
My schemas have evolved
I need to cleanse my source data
I cannot query old and new
My data doesn’t fit in My data is very Hadoop ecosystem is hard to manage data together
one machine fast
My data scientists don’t like JAVA My cluster is running old
My reports make versions. Upgrading is hard
my database And it’s not only
server very slow Map/Reduce is I am not sure which data we are
transactional hard to use already processing I want to use ML
Before 2009 2009-2011 2012-2014 2015-2017 2017-2018
The DBA years The Hadoop epiphany The Message Broker The Spark kingdom and The myth of DataOps
and NoSQL Age the spreadsheet wars
• My team spends more time maintaining the cluster than adding functionality
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Some simple things that scare me (and eat my productivity)
• Text encodings
• Date and time formats: which date would you say this is 1/4/19? And this? 1553589297
• The same JSON file when row 176.543 has a column never seen before
• The same JSON file when all the numbers are strings
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
• XML
The downfall of the data engineer
”
memory. This leads to systemic, stupid errors that waste hours.
Maxime Beauchemin, Data engineer extraordinaire at Lyft, creator of Apache Airflow and Apache Superset.
Ex-Facebook, Ex-Yahoo!, Ex-Airbnb
https://medium.com/@maximebeauchemin/the-downfall-of-the-data-engineer-5bfb701e5d6b
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Solution
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
More data lakes & analytics on AWS than anywhere else
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A data lake is a centralized repository that allows
you to store all your structured and unstructured
data at any scale
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data Lakes, Analytics, and ML Portfolio from AWS
Broadest, deepest set of analytic services
Machine Learning
Amazon SageMaker
Analytics
AWS Deep Learning AMIs Amazon Athena
Amazon Rekognition Amazon EMR
Amazon Lex Amazon Redshift
AWS DeepLens Amazon Elasticsearch service
Amazon Comprehend Amazon Kinesis
Amazon Translate Amazon QuickSight
Amazon Transcribe
Amazon Polly
On-premises Real-time
Data Movement Data Movement
AWS IoT Core
AWS Direct Connect
Amazon Kinesis Data Firehose
AWS Snowball
Amazon Kinesis Data Streams
AWS Snowmobile
Amazon Kinesis Video Streams
AWS Database Migration Service
© 2018, Amazon Web Services, Inc. or its Affiliates. AWS Storage
All rights Gateway
reserved.
Data Movement From On-premises Datacenters
Establish a dedicated Lets your on-premises Migrate database from Petabyte and Exabyte-
network connection from applications to use AWS the most widely-used scale data transport
your premises to AWS; for storage; includes a commercial and open- solution that uses secure
reduces your network highly-optimized data source offerings to AWS appliances to transfer
costs, increase bandwidth transfer mechanism, quickly and securely with large amounts of data
throughput, and provide a bandwidth management, minimal downtime to into and out of the AWS
more consistent network along with local cache applications cloud
experience than Internet-
based connections
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Movement From Real-time Sources
Supports billions of Capture, transform, and Build custom, real-time Securely stream video
Fully managed open-
devices and trillions of load data streams into applications that process from connected devices
source platform for
messages, and can AWS data stores for near data streams using to AWS for analytics,
building real-time
process and route those real-time analytics with popular stream machine learning (ML),
streaming data pipelines
messages to AWS existing business processing frameworks and other processing
and applications.
endpoints and to other intelligence tools.
devices reliably and
securely
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3—Object Storage
Built for eleven nine’s of Three different forms of Run analytics & ML on Classify, report, and
durability; data encryption; encrypts data data lake without data visualize data usage
distributed across 3 in transit when movement; S3 Select can trends; objects can be
physical facilities in an replicating across regions; retrieve subset of data, tagged to see storage
AWS region; log and monitor with improving analytics consumption, cost, and
automatically replicated CloudTrail, use ML to performance by 400% security; build lifecycle
to any other AWS region discover and protect policies to automate
sensitive data with Macie tiering, and retention
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Unmatched Durability and Availability
Scalable and durable
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Any Scale
Scalable and durable
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Glacier—Backup and Archive
$
Built for eleven nine’s of Three retrieval options to Log and monitor with Lowest cost AWS object
durability; data fit your use case; CloudTrail, Vault Lock storage class, allowing
distributed across 3 expedited retrievals with enables WORM storage you to archive large
physical facilities in an Glacier Select can return capabilities, helping amounts of data at a very
AWS region; data in minutes satisfy compliance low cost
automatically replicated requirements
to any other AWS region
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Preparation Accounts for ~80% of the Work
Refining algorithms
Other
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Storing is Not Enough, Data Needs to Be Discoverable
”
direct monetizing).
Gartner IT Glossary, 2018
https://www.gartner.com/it-glossary/dark-data
CRM ERP Data warehouse Mainframe Web Social Log Machine Semi- Unstructured
data files data structured
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue—Data Catalog
Make data discoverable
Glue
Data Catalog
• Automatically discovers data and stores schema
Discover data and • Catalog makes data searchable, and available for ETL
extract schema
• Catalog contains table and job definitions
Compliance
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS G l u e Crawl e rs
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue—ETL Service
Make ETL scripting and deployment easy
• Serverless
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes, Analytics, and ML Portfolio from AWS
Broadest, deepest set of analytic services
Machine Learning
Amazon SageMaker
Analytics
AWS Deep Learning AMIs Amazon Athena
Amazon Rekognition Amazon EMR
Amazon Lex Amazon Redshift
AWS DeepLens Amazon Elasticsearch service
Amazon Comprehend Amazon Kinesis
Amazon Translate Amazon QuickSight
Amazon Transcribe
Amazon Polly
On-premises Real-time
Data Movement Data Movement
AWS IoT Core
AWS Direct Connect
Amazon Kinesis Data Firehose
AWS Snowball
Amazon Kinesis Data Streams
AWS Snowmobile
Amazon Kinesis Video Streams
AWS Database Migration Service
© 2018, Amazon Web Services, Inc. or its Affiliates. AWS Storage
All rights Gateway
reserved.
Amazon EMR—Big Data Processing
Data Lake
$ 100110000100101011100
101010111001010100000
111100101100101010001
100001
Launch fully managed Flexible billing with per- Process data directly in Updated with the latest
Hadoop & Spark in second billing, EC2 spot, the S3 data lake securely open source frameworks
minutes; no cluster reserved instances and with high performance within 30 days of release
setup, node provisioning, auto-scaling to reduce using the EMRFS
cluster tuning costs 50–80% connector
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EMR— More than just managed Hadoop
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift—Data Warehousing
$
Columnar storage Analyze optimized data Audit everything; encrypt As low as $1,000 per
technology to improve formats on the latest data end-to-end; terabyte per year, 1/10th
I/O efficiency and scale SSD, and all open data extensive certification the cost of traditional
query performance formats in Amazon S3 and compliance data warehouse
solutions; start at $0.25
per hour
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift Spectrum
Extend the data warehouse to exabytes of data in S3 data lake
Redshift Spectrum
query engine • Exabyte Redshift SQL queries against S3
Redshift data S3 data lake • Stable query performance and unlimited concurrency
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Let’s play a game
https://youtu.be/RpPf38L0HHU?t=3963
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Numbers are fun
https://youtu.be/RpPf38L0HHU?t=3963
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Numbers are fun
https://youtu.be/RpPf38L0HHU?t=3963
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena—Interactive Analysis
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Ability to run SQL queries on data archived in Amazon Glacier (coming soon)
$
SQL
Zero setup cost; just Pay only for queries ANSI SQL interface, Serverless: zero
point to S3 and run; save 30–90% on JDBC/ODBC drivers, infrastructure, zero
start querying per-query costs multiple formats, administration
through compression compression types, Integrated with
and complex joins and QuickSight
data types
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon QuickSight
easy
Empower Seamless
everyone connectivity
Machine Learning
Amazon SageMaker
Analytics
AWS Deep Learning AMIs Amazon Athena
Amazon Rekognition Amazon EMR
Amazon Lex Amazon Redshift
AWS DeepLens Amazon Elasticsearch service
Amazon Comprehend Amazon Kinesis
Amazon Translate Amazon QuickSight
Amazon Transcribe
Amazon Polly
On-premises Real-time
Data Movement Data Movement
AWS IoT Core
AWS Direct Connect
Amazon Kinesis Data Firehose
AWS Snowball
Amazon Kinesis Data Streams
AWS Snowmobile
Amazon Kinesis Video Streams
AWS Database Migration Service
© 2018, Amazon Web Services, Inc. or its Affiliates. AWS Storage
All rights Gateway
reserved.
Data Lakes from AWS
Secure
Data Lake
on AWS
Cost-effective
On-premises Real-time Data
Data Movement Movement
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Provides Highest Levels of Security
Secure
Customer need to have multiple levels of security, identity and access management,
encryption, and compliance to secure their data lake
SOC 2 G
GxP IT-Grundschutz
Security, Availability, & X P Quality Guidelines IRAP [Australia] [Germany]
Confidentiality Report and Regulations Australian Security Baseline Protection
Standards Methodology
SOC 3 HIPPA
General Controls Protected Health
Information K-ISMS [Korea]
Report
Korean Information
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Security
Data Lakes from AWS
Secure
Data Lake
on AWS
Cost-effective
On-premises Real-time Data
Data Movement Movement
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Runs the Largest Global Cloud Infrastructure
Scalable and durable
CUSTOMER DATA
this Magic Quadrant.”
- Gartner Magic Quadrant for Public Cloud Storage Services, Worldwide
Raj Bala, Arun Chandrasekaran, John McArthur, July 24, 2017
TIME
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes from AWS
Secure
Data Lake
on AWS
Lowest cost
On-premises Real-time Data
Data Movement Movement
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Pay Only for the Resources You Use as you Scale
Lowest Cost
Traditional approach leads to wasted capacity
Unmet demand
upset players
missed revenue Servers
• Pay-as-you-go for the resources you consume
Demand
Excess capacity
wasted $$$
• As low as $0.05/GB scanned with Athena
Demand
• Run on spare compute capacity with EMR and
save up to 90% with Spot
AWS: Elastic
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS databases and analytics
Broad and deep portfolio, built for builders
Business Intelligence & Machine Learning
Amazon Amazon
AWS Marketplace
QuickSight SageMaker
Amazon
Comprehend
Amazon
Rekognition
Amazon
Lex
Amazon
Transcribe
AWS DeepLens 250+ solutions
RDS
MySQL, PostgreSQL, MariaDB, RDS on VMWare
Amazon Elasticsearch service
Operational Analytics
25+ Blockchain
solutions
Oracle, SQL Server
Data Movement
Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams | Data Pipeline | Direct Connect 30+ solutions
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Fortnite | 125+ million players
CH AL L ENGE
Need to create constant feedback loop
for designers
S3
Game
services
Tableau/BI Use Amazon EMR for large batch data processing
Databases
ETL using S3 Ad-hoc SQL
APIs EMR (Data Lake)
Other
Game designers use data to inform their decisions
sources
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Demo Overview
https://aws.amazon.com/blogs/big-data/harmonize-query-and-visualize-data-
from-various-providers-using-aws-glue-amazon-athena-and-amazon-quicksight/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Typical steps of building a data lake
1 Setup Storage
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Building data lakes can still take months
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Lake Formation (join the preview)
Build, secure, and manage a data lake in days
Build and deploy a fully Centrally define security, Empower analyst and data scientist
governance, and auditing policies in productivity, giving them self-
managed data lake with a few
one place and enforce those policies service discovery and safe access to
clicks
for all users and all applications all data from a single catalog
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How it works: AWS Lake Formation
Build Data Lakes quickly Simplify security management Enable self-service and combined analytics
• Identify, crawl, and catalog sources • Enforce encryption • Analysts discover all data available for analysis
• Ingest and clean data • Define access policies from a single data catalog
• Transform into optimal formats • Implement audit login • Use multiple analytics tools over the same data
CRM
Athena
LOB
Data
Catalog Amazon
Devices EMR
Sensors
S3 Amazon
Web Redshift
Social Kinesis
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon
QuickSight
Customer interest in AWS Lake Formation
“We are very excited about the launch of AWS Lake
Formation, which provides a central point of control to
easily load, clean, secure, and catalog data from thousands of
clients to our AWS-based data lake, dramatically reducing
our operational load. … Additionally, AWS Lake Formation
will be HIPAA compliant from day one …”
- Aaron Symanski, CTO, Change Healthcare
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!
Javier Ramirez
@supercoco9
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Select AWS Glue customers
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.