0% found this document useful (0 votes)
91 views73 pages

58076778-Node Javier Ramirez - AWS PDF

Uploaded by

Spit Fire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views73 pages

58076778-Node Javier Ramirez - AWS PDF

Uploaded by

Spit Fire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Building Data Lakes and Analytics

on AWS
Javier Ramirez
AWS Tech Evangelist
@supercoco9

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
To Become a Leader, Data is Your Differentiator

Organic revenue growth

24% Organizations that successfully


generate business value from their
data, will outperform their peers. An
15% Aberdeen survey saw organizations
who implemented a Data Lake
outperforming similar companies by
9% in organic revenue growth.*

Leaders Followers
*Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Streaming is hard
Duplicating batch/stream is inefficient
My schemas have evolved
I need to cleanse my source data
I cannot query old and new
My data doesn’t fit in My data is very Hadoop ecosystem is hard to manage data together
one machine fast
My data scientists don’t like JAVA My cluster is running old
My reports make versions. Upgrading is hard
my database And it’s not only
server very slow Map/Reduce is I am not sure which data we are
transactional hard to use already processing I want to use ML
Before 2009 2009-2011 2012-2014 2015-2017 2017-2018
The DBA years The Hadoop epiphany The Message Broker The Spark kingdom and The myth of DataOps
and NoSQL Age the spreadsheet wars

Overnight DB dump Hadoop Kafka/RabbitMQ Kafka/Spark Kafka/Flink (JAVA or Scala


required)
Read-only replica Map/Reduce all the Cassandra/HBASE Complex ETL
things /STORM Complex ETL with a pinch of
Create new departments for data ML
governance
Solution Basic ETL Apache Atlas
Spreadsheet all the things
Hive Commercial distributions
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Some problems during all periods

• My team spends more time maintaining the cluster than adding functionality

• Security and monitoring are hard

• Most of my time my cluster is sitting idle; Then it’s a bottleneck

• I don’t have the time to experiment

• Data preparation, cleansing, and basic transformations take a


disproportionally high amount of my time. And it’s so frustrating

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Some simple things that scare me (and eat my productivity)
• Text encodings

• Empty strings. Literal ”NULL” strings

• Uppercase and Lowercase

• Date and time formats: which date would you say this is 1/4/19? And this? 1553589297

• CSV, especially if uploaded by end users

• JSON files with a single array and 200.000 records inside

• The same JSON file when row 176.543 has a column never seen before

• The same JSON file when all the numbers are strings
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

• XML
The downfall of the data engineer

“ Watching paint dry is exciting in comparison to writing and maintaining Extract


Transform and Load (ETL) logic. Most ETL jobs take a long time to execute and errors
or issues tend to happen at runtime or are post-runtime assertions. Since the
development time to execution time ratio is typically low, being productive means
juggling with multiple pipelines at once and inherently doing a lot of context
switching. By the time one of your 5 running “big data jobs” has finished, you have to
get back in the mind space you were in many hours ago and craft your next iteration.
Depending on how caffeinated you are, how long it’s been since the last iteration, and
how systematic you are, you may fail at restoring the full context in your short term


memory. This leads to systemic, stupid errors that waste hours.

Maxime Beauchemin, Data engineer extraordinaire at Lyft, creator of Apache Airflow and Apache Superset.
Ex-Facebook, Ex-Yahoo!, Ex-Airbnb

https://medium.com/@maximebeauchemin/the-downfall-of-the-data-engineer-5bfb701e5d6b

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Solution

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
More data lakes & analytics on AWS than anywhere else

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A data lake is a centralized repository that allows
you to store all your structured and unstructured
data at any scale

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data Lakes, Analytics, and ML Portfolio from AWS
Broadest, deepest set of analytic services

Machine Learning
Amazon SageMaker
Analytics
AWS Deep Learning AMIs Amazon Athena
Amazon Rekognition Amazon EMR
Amazon Lex Amazon Redshift
AWS DeepLens Amazon Elasticsearch service
Amazon Comprehend Amazon Kinesis
Amazon Translate Amazon QuickSight
Amazon Transcribe
Amazon Polly

Data Lake on AWS


Storage | Archival Storage | Data Catalog

On-premises Real-time
Data Movement Data Movement
AWS IoT Core
AWS Direct Connect
Amazon Kinesis Data Firehose
AWS Snowball
Amazon Kinesis Data Streams
AWS Snowmobile
Amazon Kinesis Video Streams
AWS Database Migration Service
© 2018, Amazon Web Services, Inc. or its Affiliates. AWS Storage
All rights Gateway
reserved.
Data Movement From On-premises Datacenters

AWS Storage AWS Database AWS Snowball,


AWS Direct Connect
Gateway Migration Service Snowball Edge and
Snowmobile

Establish a dedicated Lets your on-premises Migrate database from Petabyte and Exabyte-
network connection from applications to use AWS the most widely-used scale data transport
your premises to AWS; for storage; includes a commercial and open- solution that uses secure
reduces your network highly-optimized data source offerings to AWS appliances to transfer
costs, increase bandwidth transfer mechanism, quickly and securely with large amounts of data
throughput, and provide a bandwidth management, minimal downtime to into and out of the AWS
more consistent network along with local cache applications cloud
experience than Internet-
based connections

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Movement From Real-time Sources

Amazon Kinesis Data Amazon Kinesis Data Amazon Kinesis


AWS IoT Core Managed Streaming
Firehose Streams Video Streams For Kafka

Supports billions of Capture, transform, and Build custom, real-time Securely stream video
Fully managed open-
devices and trillions of load data streams into applications that process from connected devices
source platform for
messages, and can AWS data stores for near data streams using to AWS for analytics,
building real-time
process and route those real-time analytics with popular stream machine learning (ML),
streaming data pipelines
messages to AWS existing business processing frameworks and other processing
and applications.
endpoints and to other intelligence tools.
devices reliably and
securely

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3—Object Storage

Durability, Availability Security and


Query in Place Flexible Management
& Scalability Compliance

Built for eleven nine’s of Three different forms of Run analytics & ML on Classify, report, and
durability; data encryption; encrypts data data lake without data visualize data usage
distributed across 3 in transit when movement; S3 Select can trends; objects can be
physical facilities in an replicating across regions; retrieve subset of data, tagged to see storage
AWS region; log and monitor with improving analytics consumption, cost, and
automatically replicated CloudTrail, use ML to performance by 400% security; build lifecycle
to any other AWS region discover and protect policies to automate
sensitive data with Macie tiering, and retention

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Unmatched Durability and Availability
Scalable and durable

• Designed to deliver 99.999999999% durability

• Geographic redundancy & automatic replication

• Store data in multiple data centers across 3 AZs in


a single region

• Seamlessly replicates data between any region


(But don’t run analytics across regions. Latency
and cost will not be efficient)

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Any Scale
Scalable and durable

• S3 has trillions of objects and exabytes of data

• Built to store any amount of data

• Runs on the world’s largest global


cloud infrastructure

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Glacier—Backup and Archive

Durability, Availability Retrieves data in


Secure Inexpensive
& Scalability minutes

$
Built for eleven nine’s of Three retrieval options to Log and monitor with Lowest cost AWS object
durability; data fit your use case; CloudTrail, Vault Lock storage class, allowing
distributed across 3 expedited retrievals with enables WORM storage you to archive large
physical facilities in an Glacier Select can return capabilities, helping amounts of data at a very
AWS region; data in minutes satisfy compliance low cost
automatically replicated requirements
to any other AWS region

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Preparation Accounts for ~80% of the Work

Building training sets

Cleaning and organizing data

Collecting data sets

Mining data for patterns

Refining algorithms

Other

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Storing is Not Enough, Data Needs to Be Discoverable

“ Dark data are the information


assets organizations collect,
process, and store during
regular business activities,
but generally fail to use for other
purposes (for example, analytics,
business relationships and


direct monetizing).
Gartner IT Glossary, 2018
https://www.gartner.com/it-glossary/dark-data

CRM ERP Data warehouse Mainframe Web Social Log Machine Semi- Unstructured
data files data structured

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue—Data Catalog
Make data discoverable

Glue
Data Catalog
• Automatically discovers data and stores schema

Discover data and • Catalog makes data searchable, and available for ETL
extract schema
• Catalog contains table and job definitions

• Computes statistics to make queries efficient

Compliance

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS G l u e Crawl e rs

Crawlers automatically build your Data


Catalog and keep it in sync.

Automatically discover new data, extracts


schema definitions
Detect schema changes and version tables
Detect Hive style partitions on Amazon S3

Crawlers Built-in classifiers for popular types; custom


Automatically catalog your data classifiers using Grok expression
Run ad hoc or on a schedule; serverless – only
pay when crawler runs
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue—ETL Service
Make ETL scripting and deployment easy

• Automatically generates ETL code. Spark


(Scale/Python) or Python shell script.

• Code is customizable (demo later on. Yay!)

• Endpoints provided to edit, debug,


test code

• Jobs are scheduled or event-based

• Serverless

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes, Analytics, and ML Portfolio from AWS
Broadest, deepest set of analytic services

Machine Learning
Amazon SageMaker
Analytics
AWS Deep Learning AMIs Amazon Athena
Amazon Rekognition Amazon EMR
Amazon Lex Amazon Redshift
AWS DeepLens Amazon Elasticsearch service
Amazon Comprehend Amazon Kinesis
Amazon Translate Amazon QuickSight
Amazon Transcribe
Amazon Polly

Data Lake on AWS


Storage | Archival Storage | Data Catalog

On-premises Real-time
Data Movement Data Movement
AWS IoT Core
AWS Direct Connect
Amazon Kinesis Data Firehose
AWS Snowball
Amazon Kinesis Data Streams
AWS Snowmobile
Amazon Kinesis Video Streams
AWS Database Migration Service
© 2018, Amazon Web Services, Inc. or its Affiliates. AWS Storage
All rights Gateway
reserved.
Amazon EMR—Big Data Processing

Easy Low cost Use S3 storage Latest versions

Data Lake

$ 100110000100101011100
101010111001010100000
111100101100101010001
100001

Launch fully managed Flexible billing with per- Process data directly in Updated with the latest
Hadoop & Spark in second billing, EC2 spot, the S3 data lake securely open source frameworks
minutes; no cluster reserved instances and with high performance within 30 days of release
setup, node provisioning, auto-scaling to reduce using the EMRFS
cluster tuning costs 50–80% connector

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EMR— More than just managed Hadoop

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift—Data Warehousing

Fast at scale Open file formats Secure Inexpensive

$
Columnar storage Analyze optimized data Audit everything; encrypt As low as $1,000 per
technology to improve formats on the latest data end-to-end; terabyte per year, 1/10th
I/O efficiency and scale SSD, and all open data extensive certification the cost of traditional
query performance formats in Amazon S3 and compliance data warehouse
solutions; start at $0.25
per hour

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift Spectrum
Extend the data warehouse to exabytes of data in S3 data lake

Redshift Spectrum
query engine • Exabyte Redshift SQL queries against S3

• Join data across Redshift and S3

• Scale compute and storage separately

Redshift data S3 data lake • Stable query performance and unlimited concurrency

• CSV, ORC, Avro, & Parquet data formats

• Pay only for the amount of data scanned

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Let’s play a game

Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017

https://youtu.be/RpPf38L0HHU?t=3963

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Numbers are fun

Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017

https://youtu.be/RpPf38L0HHU?t=3963

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Numbers are fun

Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017

https://youtu.be/RpPf38L0HHU?t=3963

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena—Interactive Analysis
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Ability to run SQL queries on data archived in Amazon Glacier (coming soon)

Query Instantly Pay per query Open Easy

$
SQL
Zero setup cost; just Pay only for queries ANSI SQL interface, Serverless: zero
point to S3 and run; save 30–90% on JDBC/ODBC drivers, infrastructure, zero
start querying per-query costs multiple formats, administration
through compression compression types, Integrated with
and complex joins and QuickSight
data types
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon QuickSight
easy

Empower Seamless
everyone connectivity

Fast analysis Serverless

Now with ML superpowers!


© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes, Analytics, and ML Portfolio from AWS
Broadest, deepest set of analytic services

Machine Learning
Amazon SageMaker
Analytics
AWS Deep Learning AMIs Amazon Athena
Amazon Rekognition Amazon EMR
Amazon Lex Amazon Redshift
AWS DeepLens Amazon Elasticsearch service
Amazon Comprehend Amazon Kinesis
Amazon Translate Amazon QuickSight
Amazon Transcribe
Amazon Polly

Data Lake on AWS


Storage | Archival Storage | Data Catalog

On-premises Real-time
Data Movement Data Movement
AWS IoT Core
AWS Direct Connect
Amazon Kinesis Data Firehose
AWS Snowball
Amazon Kinesis Data Streams
AWS Snowmobile
Amazon Kinesis Video Streams
AWS Database Migration Service
© 2018, Amazon Web Services, Inc. or its Affiliates. AWS Storage
All rights Gateway
reserved.
Data Lakes from AWS

Machine Learning Analytics


Open and comprehensive

Secure
Data Lake
on AWS

Scalable and durable

Cost-effective
On-premises Real-time Data
Data Movement Movement

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Provides Highest Levels of Security
Secure

Customer need to have multiple levels of security, identity and access management,
encryption, and compliance to secure their data lake

Security Identity Encryption Compliance

Amazon GuardDuty AWS IAM AWS Certification Manager AWS Artifact

AWS Shield AWS SSO AWS Key Management Amazon Inspector


Service
AWS WAF Amazon Cloud Directory Amazon Cloud HSM
Encryption at rest
Amazon Macie AWS Directory Service Amazon Cognito
Encryption in transit
VPC AWS Organizations AWS CloudTrail
Bring your own keys, HSM
support
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Compliance: Virtually Every Regulatory Agency
Global United States
CSA CJIS ITAR MTCS Tier 3 [Singapore]
Cloud Security Criminal Justice International Arms Multi-Tier Cloud
Alliance Controls Information Services Regulations Security Standard

ISO 9001 DoD SRG MPAA My Number Act [Japan]


Global Quality DoD Data Protected Media Personal Information
Standard Processing Content Protection

ISO 27001 FedRAMP NIST


Security Management Government Data National Institute of Europe
Controls Standards Standards and Technology
FERPA C5 [Germany]
ISO 27017 Operational Security
Cloud Specific Educational SEC Rule 17a-4(f)
Financial Data Attestation
Controls Privacy Act
Standards
ISO 27018 ISO FFIEC
Cyber Essentials
Personal Data Financial Institutions VPAT/Section 508
Regulation Accountability
Plus [UK]
Protection
Standards Cyber Threat
PCI DSS Level 1 FIPS Protection
Payment Card Government Security
Standards Standards Asia Pacific
G-Cloud [UK]
SOC 1 FISMA FISC [Japan] UK Government
Federal Information Financial Industry Standards
Audit Controls
Report Security Management Information Systems

SOC 2 G
GxP IT-Grundschutz
Security, Availability, & X P Quality Guidelines IRAP [Australia] [Germany]
Confidentiality Report and Regulations Australian Security Baseline Protection
Standards Methodology
SOC 3 HIPPA
General Controls Protected Health
Information K-ISMS [Korea]
Report
Korean Information
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Security
Data Lakes from AWS

Machine Learning Analytics


Open and comprehensive

Secure
Data Lake
on AWS

Scalable and durable

Cost-effective
On-premises Real-time Data
Data Movement Movement

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Runs the Largest Global Cloud Infrastructure
Scalable and durable

“…the scale at which AWS operates its public


cloud storage services dwarfs the other vendors in

CUSTOMER DATA
this Magic Quadrant.”
- Gartner Magic Quadrant for Public Cloud Storage Services, Worldwide
Raj Bala, Arun Chandrasekaran, John McArthur, July 24, 2017

For example: Amazon S3 holds trillions of objects and


regularly peaks at millions of requests per second

TIME
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes from AWS

Machine Learning Analytics


Open and comprehensive

Secure
Data Lake
on AWS

Scalable and durable

Lowest cost
On-premises Real-time Data
Data Movement Movement

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Pay Only for the Resources You Use as you Scale
Lowest Cost
Traditional approach leads to wasted capacity
Unmet demand
upset players
missed revenue Servers
• Pay-as-you-go for the resources you consume
Demand
Excess capacity
wasted $$$
• As low as $0.05/GB scanned with Athena

Traditional: Rigid • EMR and Athena can automatically scale down


resources after job completes, saving you costs
AWS approach: pay for the capacity you use • Commit to a set term and save up to 75% with
Capacity
Reserved Instance

Demand
• Run on spare compute capacity with EMR and
save up to 90% with Spot
AWS: Elastic

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS databases and analytics
Broad and deep portfolio, built for builders
Business Intelligence & Machine Learning
Amazon Amazon
AWS Marketplace
QuickSight SageMaker
Amazon
Comprehend
Amazon
Rekognition
Amazon
Lex
Amazon
Transcribe
AWS DeepLens 250+ solutions

Databases Analytics Blockchain


QLDB
Neptune
730+ Database
solutions
Ledger Database
Graph
ElastiCache Amazon Redshift Athena Managed
Redis, Memcached DynamoDB Data warehousing Interactive analytics Blockchain
Key value, Document 600+ Analytics
solutions
Aurora Amazon EMR Kinesis Analytics Blockchain
Timestream
MySQL, PostgreSQL Hadoop + Spark Real-time
Time Series Templates

RDS
MySQL, PostgreSQL, MariaDB, RDS on VMWare
Amazon Elasticsearch service
Operational Analytics
25+ Blockchain
solutions
Oracle, SQL Server

S3/Amazon Glacier Lake Formation


Data Lakes
Data Lake
AWS Glue
ETL & Data Catalog 20+ Data lake
solutions

Data Movement
Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams | Data Pipeline | Direct Connect 30+ solutions

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Fortnite | 125+ million players

CH AL L ENGE
Need to create constant feedback loop
for designers

Gain up-to-the-minute understanding


of gamer satisfaction to guarantee
gamers are engaged, thus resulting in
the most popular game played in the
world
Epic Games uses Data Lakes and analytics

Entire analytics platform running on AWS


NEAR REALTIME PIPELINE
NEAR REALTIME PIPELINES Grafana
S3 leveraged as a Data Lake
Game Scoreboards API
clients
DynamoDB
Spark on EMR
Limited Raw Data
All telemetry data is collected with Kinesis
User ETL
Game (real time ad-hoc SQL)
(metric definition)
servers

Real-time analytics done through Spark on EMR,


DynamoDB to create scoreboards and real-time queries
NEAR REALTIME PIPELINE
BATCH PIPELINES
Launcher
Kinesis

S3
Game
services
Tableau/BI Use Amazon EMR for large batch data processing
Databases
ETL using S3 Ad-hoc SQL
APIs EMR (Data Lake)
Other
Game designers use data to inform their decisions
sources

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Demo Overview

https://aws.amazon.com/blogs/big-data/harmonize-query-and-visualize-data-
from-various-providers-using-aws-glue-amazon-athena-and-amazon-quicksight/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Typical steps of building a data lake

1 Setup Storage

4 Configure and enforce


2 Move data security and compliance
policies
3 Cleanse, prep, and
catalog data 5 Make data available
for analytics

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Building data lakes can still take months

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Lake Formation (join the preview)
Build, secure, and manage a data lake in days

Build a data lake in days, Enforce security policies Combine different


not months across multiple services analytics approaches

Build and deploy a fully Centrally define security, Empower analyst and data scientist
governance, and auditing policies in productivity, giving them self-
managed data lake with a few
one place and enforce those policies service discovery and safe access to
clicks
for all users and all applications all data from a single catalog

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How it works: AWS Lake Formation
Build Data Lakes quickly Simplify security management Enable self-service and combined analytics
• Identify, crawl, and catalog sources • Enforce encryption • Analysts discover all data available for analysis
• Ingest and clean data • Define access policies from a single data catalog
• Transform into optimal formats • Implement audit login • Use multiple analytics tools over the same data

OLTP IAM KMS


AI Services
ERP

CRM
Athena

LOB

Data
Catalog Amazon
Devices EMR

Sensors

S3 Amazon
Web Redshift

Social Kinesis

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon
QuickSight
Customer interest in AWS Lake Formation
“We are very excited about the launch of AWS Lake
Formation, which provides a central point of control to
easily load, clean, secure, and catalog data from thousands of
clients to our AWS-based data lake, dramatically reducing
our operational load. … Additionally, AWS Lake Formation
will be HIPAA compliant from day one …”
- Aaron Symanski, CTO, Change Healthcare

“I can’t wait for my team to get our hands on AWS Lake


Formation. With an enterprise-ready option like Lake
Formation, we will be able to spend more time deriving
value from our data rather than doing the heavy lifting
involved in manually setting up and managing our data lake.”
- Joshua Couch, VP Engineering, Fender Digital

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!
Javier Ramirez
@supercoco9

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Select AWS Glue customers

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

You might also like