0% found this document useful (0 votes)

91 views73 pages

58076778-Node Javier Ramirez - AWS PDF

Uploaded by

Spit Fire

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

91 views73 pages

58076778-Node Javier Ramirez - AWS PDF

Uploaded by

Spit Fire

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 73

Building Data Lakes and Analytics

on AWS
Javier Ramirez
AWS Tech Evangelist
@supercoco9

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
To Become a Leader, Data is Your Differentiator

Organic revenue growth

24% Organizations that successfully

generate business value from their
data, will outperform their peers. An
15% Aberdeen survey saw organizations
who implemented a Data Lake
outperforming similar companies by
9% in organic revenue growth.*

Leaders Followers
*Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Streaming is hard
Duplicating batch/stream is inefficient
My schemas have evolved
I need to cleanse my source data
I cannot query old and new
My data doesn’t fit in My data is very Hadoop ecosystem is hard to manage data together
one machine fast
My data scientists don’t like JAVA My cluster is running old
My reports make versions. Upgrading is hard
my database And it’s not only
server very slow Map/Reduce is I am not sure which data we are
transactional hard to use already processing I want to use ML
Before 2009 2009-2011 2012-2014 2015-2017 2017-2018
The DBA years The Hadoop epiphany The Message Broker The Spark kingdom and The myth of DataOps
and NoSQL Age the spreadsheet wars

Overnight DB dump Hadoop Kafka/RabbitMQ Kafka/Spark Kafka/Flink (JAVA or Scala

required)
Read-only replica Map/Reduce all the Cassandra/HBASE Complex ETL
things /STORM Complex ETL with a pinch of
Create new departments for data ML
governance
Solution Basic ETL Apache Atlas
Spreadsheet all the things
Hive Commercial distributions
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Some problems during all periods

• My team spends more time maintaining the cluster than adding functionality

• Security and monitoring are hard

• Most of my time my cluster is sitting idle; Then it’s a bottleneck

• I don’t have the time to experiment

• Data preparation, cleansing, and basic transformations take a

disproportionally high amount of my time. And it’s so frustrating

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Some simple things that scare me (and eat my productivity)
• Text encodings

• Empty strings. Literal ”NULL” strings

• Uppercase and Lowercase

• Date and time formats: which date would you say this is 1/4/19? And this? 1553589297

• CSV, especially if uploaded by end users

• JSON files with a single array and 200.000 records inside

• The same JSON file when row 176.543 has a column never seen before

• The same JSON file when all the numbers are strings
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

• XML
The downfall of the data engineer

“ Watching paint dry is exciting in comparison to writing and maintaining Extract

Transform and Load (ETL) logic. Most ETL jobs take a long time to execute and errors
or issues tend to happen at runtime or are post-runtime assertions. Since the
development time to execution time ratio is typically low, being productive means
juggling with multiple pipelines at once and inherently doing a lot of context
switching. By the time one of your 5 running “big data jobs” has finished, you have to
get back in the mind space you were in many hours ago and craft your next iteration.
Depending on how caffeinated you are, how long it’s been since the last iteration, and
how systematic you are, you may fail at restoring the full context in your short term

”
memory. This leads to systemic, stupid errors that waste hours.

Maxime Beauchemin, Data engineer extraordinaire at Lyft, creator of Apache Airflow and Apache Superset.
Ex-Facebook, Ex-Yahoo!, Ex-Airbnb

https://medium.com/@maximebeauchemin/the-downfall-of-the-data-engineer-5bfb701e5d6b

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
More data lakes & analytics on AWS than anywhere else

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A data lake is a centralized repository that allows
you to store all your structured and unstructured
data at any scale

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data Lakes, Analytics, and ML Portfolio from AWS
Broadest, deepest set of analytic services

Machine Learning
Amazon SageMaker
Analytics
AWS Deep Learning AMIs Amazon Athena
Amazon Rekognition Amazon EMR
Amazon Lex Amazon Redshift
AWS DeepLens Amazon Elasticsearch service
Amazon Comprehend Amazon Kinesis
Amazon Translate Amazon QuickSight
Amazon Transcribe
Amazon Polly

Data Lake on AWS

Storage | Archival Storage | Data Catalog

On-premises Real-time
Data Movement Data Movement
AWS IoT Core
AWS Direct Connect
Amazon Kinesis Data Firehose
AWS Snowball
Amazon Kinesis Data Streams
AWS Snowmobile
Amazon Kinesis Video Streams
AWS Database Migration Service
© 2018, Amazon Web Services, Inc. or its Affiliates. AWS Storage
All rights Gateway
reserved.
Data Movement From On-premises Datacenters

AWS Storage AWS Database AWS Snowball,

AWS Direct Connect
Gateway Migration Service Snowball Edge and
Snowmobile

Establish a dedicated Lets your on-premises Migrate database from Petabyte and Exabyte-
network connection from applications to use AWS the most widely-used scale data transport
your premises to AWS; for storage; includes a commercial and open- solution that uses secure
reduces your network highly-optimized data source offerings to AWS appliances to transfer
costs, increase bandwidth transfer mechanism, quickly and securely with large amounts of data
throughput, and provide a bandwidth management, minimal downtime to into and out of the AWS
more consistent network along with local cache applications cloud
experience than Internet-
based connections

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Movement From Real-time Sources

Amazon Kinesis Data Amazon Kinesis Data Amazon Kinesis

AWS IoT Core Managed Streaming
Firehose Streams Video Streams For Kafka

Supports billions of Capture, transform, and Build custom, real-time Securely stream video
Fully managed open-
devices and trillions of load data streams into applications that process from connected devices
source platform for
messages, and can AWS data stores for near data streams using to AWS for analytics,
building real-time
process and route those real-time analytics with popular stream machine learning (ML),
streaming data pipelines
messages to AWS existing business processing frameworks and other processing
and applications.
endpoints and to other intelligence tools.
devices reliably and
securely

Durability, Availability Security and

Query in Place Flexible Management
& Scalability Compliance

Built for eleven nine’s of Three different forms of Run analytics & ML on Classify, report, and
durability; data encryption; encrypts data data lake without data visualize data usage
distributed across 3 in transit when movement; S3 Select can trends; objects can be
physical facilities in an replicating across regions; retrieve subset of data, tagged to see storage
AWS region; log and monitor with improving analytics consumption, cost, and
automatically replicated CloudTrail, use ML to performance by 400% security; build lifecycle
to any other AWS region discover and protect policies to automate
sensitive data with Macie tiering, and retention

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Unmatched Durability and Availability
Scalable and durable

• Designed to deliver 99.999999999% durability

• Geographic redundancy & automatic replication

• Store data in multiple data centers across 3 AZs in

a single region

• Seamlessly replicates data between any region

(But don’t run analytics across regions. Latency
and cost will not be efficient)

• S3 has trillions of objects and exabytes of data

• Built to store any amount of data

• Runs on the world’s largest global

cloud infrastructure

Durability, Availability Retrieves data in

Secure Inexpensive
& Scalability minutes

$
Built for eleven nine’s of Three retrieval options to Log and monitor with Lowest cost AWS object
durability; data fit your use case; CloudTrail, Vault Lock storage class, allowing
distributed across 3 expedited retrievals with enables WORM storage you to archive large
physical facilities in an Glacier Select can return capabilities, helping amounts of data at a very
AWS region; data in minutes satisfy compliance low cost
automatically replicated requirements
to any other AWS region

Building training sets

Cleaning and organizing data

Collecting data sets

Mining data for patterns

Refining algorithms

Other

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Storing is Not Enough, Data Needs to Be Discoverable

“ Dark data are the information

assets organizations collect,
process, and store during
regular business activities,
but generally fail to use for other
purposes (for example, analytics,
business relationships and

”
direct monetizing).
Gartner IT Glossary, 2018
https://www.gartner.com/it-glossary/dark-data

CRM ERP Data warehouse Mainframe Web Social Log Machine Semi- Unstructured
data files data structured

Glue
Data Catalog
• Automatically discovers data and stores schema

Discover data and • Catalog makes data searchable, and available for ETL
extract schema
• Catalog contains table and job definitions

• Computes statistics to make queries efficient

Compliance

Crawlers automatically build your Data

Catalog and keep it in sync.

Automatically discover new data, extracts

schema definitions
Detect schema changes and version tables
Detect Hive style partitions on Amazon S3

Crawlers Built-in classifiers for popular types; custom

Automatically catalog your data classifiers using Grok expression
Run ad hoc or on a schedule; serverless – only
pay when crawler runs
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue—ETL Service
Make ETL scripting and deployment easy

• Automatically generates ETL code. Spark

(Scale/Python) or Python shell script.

• Code is customizable (demo later on. Yay!)

• Endpoints provided to edit, debug,

test code

• Jobs are scheduled or event-based

• Serverless

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes, Analytics, and ML Portfolio from AWS
Broadest, deepest set of analytic services

Data Lake on AWS

Storage | Archival Storage | Data Catalog

Easy Low cost Use S3 storage Latest versions

Data Lake

$ 100110000100101011100
101010111001010100000
111100101100101010001
100001

Launch fully managed Flexible billing with per- Process data directly in Updated with the latest
Hadoop & Spark in second billing, EC2 spot, the S3 data lake securely open source frameworks
minutes; no cluster reserved instances and with high performance within 30 days of release
setup, node provisioning, auto-scaling to reduce using the EMRFS
cluster tuning costs 50–80% connector

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EMR— More than just managed Hadoop

Fast at scale Open file formats Secure Inexpensive

$
Columnar storage Analyze optimized data Audit everything; encrypt As low as $1,000 per
technology to improve formats on the latest data end-to-end; terabyte per year, 1/10th
I/O efficiency and scale SSD, and all open data extensive certification the cost of traditional
query performance formats in Amazon S3 and compliance data warehouse
solutions; start at $0.25
per hour

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift Spectrum
Extend the data warehouse to exabytes of data in S3 data lake

Redshift Spectrum
query engine • Exabyte Redshift SQL queries against S3

• Join data across Redshift and S3

• Scale compute and storage separately

Redshift data S3 data lake • Stable query performance and unlimited concurrency

• CSV, ORC, Avro, & Parquet data formats

• Pay only for the amount of data scanned

Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017

https://youtu.be/RpPf38L0HHU?t=3963

Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017

https://youtu.be/RpPf38L0HHU?t=3963

Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017

https://youtu.be/RpPf38L0HHU?t=3963

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena—Interactive Analysis
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Ability to run SQL queries on data archived in Amazon Glacier (coming soon)

Query Instantly Pay per query Open Easy

$
SQL
Zero setup cost; just Pay only for queries ANSI SQL interface, Serverless: zero
point to S3 and run; save 30–90% on JDBC/ODBC drivers, infrastructure, zero
start querying per-query costs multiple formats, administration
through compression compression types, Integrated with
and complex joins and QuickSight
data types
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon QuickSight
easy

Empower Seamless
everyone connectivity

Fast analysis Serverless

Now with ML superpowers!

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes, Analytics, and ML Portfolio from AWS
Broadest, deepest set of analytic services

Data Lake on AWS

Storage | Archival Storage | Data Catalog

Machine Learning Analytics

Open and comprehensive

Secure
Data Lake
on AWS

Scalable and durable

Cost-effective
On-premises Real-time Data
Data Movement Movement

Customer need to have multiple levels of security, identity and access management,
encryption, and compliance to secure their data lake

Security Identity Encryption Compliance

Amazon GuardDuty AWS IAM AWS Certification Manager AWS Artifact

AWS Shield AWS SSO AWS Key Management Amazon Inspector

Service
AWS WAF Amazon Cloud Directory Amazon Cloud HSM
Encryption at rest
Amazon Macie AWS Directory Service Amazon Cognito
Encryption in transit
VPC AWS Organizations AWS CloudTrail
Bring your own keys, HSM
support
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Compliance: Virtually Every Regulatory Agency
Global United States
CSA CJIS ITAR MTCS Tier 3 [Singapore]
Cloud Security Criminal Justice International Arms Multi-Tier Cloud
Alliance Controls Information Services Regulations Security Standard

ISO 9001 DoD SRG MPAA My Number Act [Japan]

Global Quality DoD Data Protected Media Personal Information
Standard Processing Content Protection

ISO 27001 FedRAMP NIST

Security Management Government Data National Institute of Europe
Controls Standards Standards and Technology
FERPA C5 [Germany]
ISO 27017 Operational Security
Cloud Specific Educational SEC Rule 17a-4(f)
Financial Data Attestation
Controls Privacy Act
Standards
ISO 27018 ISO FFIEC
Cyber Essentials
Personal Data Financial Institutions VPAT/Section 508
Regulation Accountability
Plus [UK]
Protection
Standards Cyber Threat
PCI DSS Level 1 FIPS Protection
Payment Card Government Security
Standards Standards Asia Pacific
G-Cloud [UK]
SOC 1 FISMA FISC [Japan] UK Government
Federal Information Financial Industry Standards
Audit Controls
Report Security Management Information Systems

SOC 2 G
GxP IT-Grundschutz
Security, Availability, & X P Quality Guidelines IRAP [Australia] [Germany]
Confidentiality Report and Regulations Australian Security Baseline Protection
Standards Methodology
SOC 3 HIPPA
General Controls Protected Health
Information K-ISMS [Korea]
Report
Korean Information
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Security
Data Lakes from AWS

Machine Learning Analytics

Open and comprehensive

Secure
Data Lake
on AWS

Scalable and durable

Cost-effective
On-premises Real-time Data
Data Movement Movement

“…the scale at which AWS operates its public

cloud storage services dwarfs the other vendors in

CUSTOMER DATA
this Magic Quadrant.”
- Gartner Magic Quadrant for Public Cloud Storage Services, Worldwide
Raj Bala, Arun Chandrasekaran, John McArthur, July 24, 2017

For example: Amazon S3 holds trillions of objects and

regularly peaks at millions of requests per second

Machine Learning Analytics

Open and comprehensive

Secure
Data Lake
on AWS

Scalable and durable

Lowest cost
On-premises Real-time Data
Data Movement Movement

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Pay Only for the Resources You Use as you Scale
Lowest Cost
Traditional approach leads to wasted capacity
Unmet demand
upset players
missed revenue Servers
• Pay-as-you-go for the resources you consume
Demand
Excess capacity
wasted $$$
• As low as $0.05/GB scanned with Athena

Traditional: Rigid • EMR and Athena can automatically scale down

resources after job completes, saving you costs
AWS approach: pay for the capacity you use • Commit to a set term and save up to 75% with
Capacity
Reserved Instance

Demand
• Run on spare compute capacity with EMR and
save up to 90% with Spot
AWS: Elastic

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS databases and analytics
Broad and deep portfolio, built for builders
Business Intelligence & Machine Learning
Amazon Amazon
AWS Marketplace
QuickSight SageMaker
Amazon
Comprehend
Amazon
Rekognition
Amazon
Lex
Amazon
Transcribe
AWS DeepLens 250+ solutions

Databases Analytics Blockchain

QLDB
Neptune
730+ Database
solutions
Ledger Database
Graph
ElastiCache Amazon Redshift Athena Managed
Redis, Memcached DynamoDB Data warehousing Interactive analytics Blockchain
Key value, Document 600+ Analytics
solutions
Aurora Amazon EMR Kinesis Analytics Blockchain
Timestream
MySQL, PostgreSQL Hadoop + Spark Real-time
Time Series Templates

RDS
MySQL, PostgreSQL, MariaDB, RDS on VMWare
Amazon Elasticsearch service
Operational Analytics
25+ Blockchain
solutions
Oracle, SQL Server

S3/Amazon Glacier Lake Formation

Data Lakes
Data Lake
AWS Glue
ETL & Data Catalog 20+ Data lake
solutions

CH AL L ENGE
Need to create constant feedback loop
for designers

Gain up-to-the-minute understanding

of gamer satisfaction to guarantee
gamers are engaged, thus resulting in
the most popular game played in the
world
Epic Games uses Data Lakes and analytics

Entire analytics platform running on AWS

NEAR REALTIME PIPELINE
NEAR REALTIME PIPELINES Grafana
S3 leveraged as a Data Lake
Game Scoreboards API
clients
DynamoDB
Spark on EMR
Limited Raw Data
All telemetry data is collected with Kinesis
User ETL
Game (real time ad-hoc SQL)
(metric definition)
servers

Real-time analytics done through Spark on EMR,

DynamoDB to create scoreboards and real-time queries
NEAR REALTIME PIPELINE
BATCH PIPELINES
Launcher
Kinesis

S3
Game
services
Tableau/BI Use Amazon EMR for large batch data processing
Databases
ETL using S3 Ad-hoc SQL
APIs EMR (Data Lake)
Other
Game designers use data to inform their decisions
sources

https://aws.amazon.com/blogs/big-data/harmonize-query-and-visualize-data-
from-various-providers-using-aws-glue-amazon-athena-and-amazon-quicksight/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Typical steps of building a data lake

1 Setup Storage

4 Configure and enforce

2 Move data security and compliance
policies
3 Cleanse, prep, and
catalog data 5 Make data available
for analytics

Build a data lake in days, Enforce security policies Combine different

not months across multiple services analytics approaches

Build and deploy a fully Centrally define security, Empower analyst and data scientist
governance, and auditing policies in productivity, giving them self-
managed data lake with a few
one place and enforce those policies service discovery and safe access to
clicks
for all users and all applications all data from a single catalog

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How it works: AWS Lake Formation
Build Data Lakes quickly Simplify security management Enable self-service and combined analytics
• Identify, crawl, and catalog sources • Enforce encryption • Analysts discover all data available for analysis
• Ingest and clean data • Define access policies from a single data catalog
• Transform into optimal formats • Implement audit login • Use multiple analytics tools over the same data

OLTP IAM KMS

AI Services
ERP

CRM
Athena

LOB

Data
Catalog Amazon
Devices EMR

Sensors

S3 Amazon
Web Redshift

Social Kinesis

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon
QuickSight
Customer interest in AWS Lake Formation
“We are very excited about the launch of AWS Lake
Formation, which provides a central point of control to
easily load, clean, secure, and catalog data from thousands of
clients to our AWS-based data lake, dramatically reducing
our operational load. … Additionally, AWS Lake Formation
will be HIPAA compliant from day one …”
- Aaron Symanski, CTO, Change Healthcare

“I can’t wait for my team to get our hands on AWS Lake

Formation. With an enterprise-ready option like Lake
Formation, we will be able to spend more time deriving
value from our data rather than doing the heavy lifting
involved in manually setting up and managing our data lake.”
- Joshua Couch, VP Engineering, Fender Digital

Integral As1000
100% (3)
Integral As1000
365 pages
APC Building Data Lakes On AWS SG
No ratings yet
APC Building Data Lakes On AWS SG
187 pages
DataAnalytics AWS PDF
No ratings yet
DataAnalytics AWS PDF
133 pages
Managing Workplace Monitoring and Surveillance
No ratings yet
Managing Workplace Monitoring and Surveillance
10 pages
Modernize Your Analyticsand Data Architecture
No ratings yet
Modernize Your Analyticsand Data Architecture
47 pages
Big Data PDF
No ratings yet
Big Data PDF
18 pages
Data Lake On Aws
No ratings yet
Data Lake On Aws
29 pages
aws-storage-and-edge-processin-43f14047-8b05-4f12-ba0c-9a30775fec9b-1748041998-180511161600
No ratings yet
aws-storage-and-edge-processin-43f14047-8b05-4f12-ba0c-9a30775fec9b-1748041998-180511161600
45 pages
AWS Data-Lake Ebook
No ratings yet
AWS Data-Lake Ebook
9 pages
Building Data Lakes
No ratings yet
Building Data Lakes
40 pages
Aiesec X Aws Workshop
No ratings yet
Aiesec X Aws Workshop
45 pages
Data Lakes For Maximum Flexibility
No ratings yet
Data Lakes For Maximum Flexibility
29 pages
AWS Data Lake
No ratings yet
AWS Data Lake
87 pages
AWS ML Cheat Sheet Nov 2024
No ratings yet
AWS ML Cheat Sheet Nov 2024
100 pages
DocScanner 20 Oct 2024 2-19 PM
No ratings yet
DocScanner 20 Oct 2024 2-19 PM
16 pages
AWS+Data+Lake (1)
No ratings yet
AWS+Data+Lake (1)
118 pages
Lec-AWS-part 2
No ratings yet
Lec-AWS-part 2
76 pages
Awsdataanalyticsonawstechnicaliltinstructordeck2023 230304021823 0674c2bb
No ratings yet
Awsdataanalyticsonawstechnicaliltinstructordeck2023 230304021823 0674c2bb
146 pages
Final project on data lakes with AWS
No ratings yet
Final project on data lakes with AWS
2 pages
Analytics Services v2
No ratings yet
Analytics Services v2
59 pages
Modernserverlessdatalak
No ratings yet
Modernserverlessdatalak
45 pages
AWS - 06 - Best Practice To Secure DataLake
No ratings yet
AWS - 06 - Best Practice To Secure DataLake
75 pages
lect-AWS-part 1
No ratings yet
lect-AWS-part 1
77 pages
1 AWS Analytics and Data Lakes
No ratings yet
1 AWS Analytics and Data Lakes
15 pages
1 +Craig+Stires+-+Modernize+and+Monetize+Your+Data+Platform PDF
No ratings yet
1 +Craig+Stires+-+Modernize+and+Monetize+Your+Data+Platform PDF
20 pages
PSO Data Analytics Day 1
100% (1)
PSO Data Analytics Day 1
106 pages
AWS Data Lake
100% (1)
AWS Data Lake
104 pages
PDF Handout - Opening Keynote
No ratings yet
PDF Handout - Opening Keynote
48 pages
AWS Data Analytics - Technical - Student
No ratings yet
AWS Data Analytics - Technical - Student
160 pages
Aws Data Service Notes
No ratings yet
Aws Data Service Notes
9 pages
Whats New With Amazon S3 STG201
No ratings yet
Whats New With Amazon S3 STG201
35 pages
Migrate Your On-Premise Data Warehouse To Amazon Redshift: Noman Jaffery
100% (1)
Migrate Your On-Premise Data Warehouse To Amazon Redshift: Noman Jaffery
18 pages
AWS Innovate23 Data Agenda
No ratings yet
AWS Innovate23 Data Agenda
1 page
Redshift-DA Handout
No ratings yet
Redshift-DA Handout
121 pages
Handout__Accelerate_your_analytics_journey_with_SAP_data
No ratings yet
Handout__Accelerate_your_analytics_journey_with_SAP_data
14 pages
NorthBays CRISP Artificial Data Lakes
No ratings yet
NorthBays CRISP Artificial Data Lakes
149 pages
Data Lake On The Aws Cloud With Talend Big Data Platform
100% (1)
Data Lake On The Aws Cloud With Talend Big Data Platform
13 pages
AWS Services
No ratings yet
AWS Services
34 pages
Amazon S3: An Storage Service
No ratings yet
Amazon S3: An Storage Service
14 pages
AWS Data Engineering Services
No ratings yet
AWS Data Engineering Services
24 pages
Aws TC q2 2023 Da b2b Ebook Final
No ratings yet
Aws TC q2 2023 Da b2b Ebook Final
17 pages
Databases On AWS: Uriel Ramírez, Solutions Architect Armando Barrales, Solutions Architect
No ratings yet
Databases On AWS: Uriel Ramírez, Solutions Architect Armando Barrales, Solutions Architect
48 pages
AWS Data Analytics Specialty Exam Cram Notes
No ratings yet
AWS Data Analytics Specialty Exam Cram Notes
43 pages
AWS Quick Start - AWS Purpose-Built Database Strategy - Final
No ratings yet
AWS Quick Start - AWS Purpose-Built Database Strategy - Final
32 pages
BDC Output 10
No ratings yet
BDC Output 10
7 pages
Big Data Analytics Options On AWS
100% (1)
Big Data Analytics Options On AWS
50 pages
Module 3 - Databases_on_AWS
No ratings yet
Module 3 - Databases_on_AWS
59 pages
Aws 101 Presentation Deck August 2014 1
No ratings yet
Aws 101 Presentation Deck August 2014 1
47 pages
Modern Data Architectures Using The AWS WellArchitected Data Analytics Lens REPEAT ARC321-R2
100% (1)
Modern Data Architectures Using The AWS WellArchitected Data Analytics Lens REPEAT ARC321-R2
19 pages
Reinvent Online Recap 2018 v5 425675166 190103174030 PDF
No ratings yet
Reinvent Online Recap 2018 v5 425675166 190103174030 PDF
50 pages
AWS 05 DataLake
No ratings yet
AWS 05 DataLake
78 pages
ANT205 R Achieving Your Modern Data Architecture
No ratings yet
ANT205 R Achieving Your Modern Data Architecture
71 pages
Harness Data To Reinvent Your Organization
No ratings yet
Harness Data To Reinvent Your Organization
20 pages
Event Streaming With Modern Data Pipelines in A SaaS Architecture ISV201
No ratings yet
Event Streaming With Modern Data Pipelines in A SaaS Architecture ISV201
22 pages
Alex Casalboni Advanced Serverless Architectural Patterns On AWS
No ratings yet
Alex Casalboni Advanced Serverless Architectural Patterns On AWS
48 pages
Ppb1 Workshop Batch v2
No ratings yet
Ppb1 Workshop Batch v2
43 pages
Big Data Architectural Patterns and Best Practices On AWS Presentation
100% (1)
Big Data Architectural Patterns and Best Practices On AWS Presentation
56 pages
Summer Internship Report On: Aws Data Engineering (Topic)
No ratings yet
Summer Internship Report On: Aws Data Engineering (Topic)
21 pages
Aws 101 Presentation Deck August 2014 1
No ratings yet
Aws 101 Presentation Deck August 2014 1
47 pages
Data Storage and AWS
No ratings yet
Data Storage and AWS
24 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Learning Cascading
From Everand
Learning Cascading
Michael Covert
No ratings yet
Data Classification Template
No ratings yet
Data Classification Template
4 pages
CIS Controls v8 Privacy Guide.22.01
100% (1)
CIS Controls v8 Privacy Guide.22.01
84 pages
Greenway Health, LLC v. Se. Ala. Rural Health Assocs
No ratings yet
Greenway Health, LLC v. Se. Ala. Rural Health Assocs
10 pages
CIS Guide To Enterprise Assets and Software ONLINE 2022 0330 1
100% (1)
CIS Guide To Enterprise Assets and Software ONLINE 2022 0330 1
5 pages
How Many California Counties Use 'Glitchy' Dominion Voting System? - California Globe
No ratings yet
How Many California Counties Use 'Glitchy' Dominion Voting System? - California Globe
32 pages
(2022) CIS Controls Cloud Companion Guide - CIS
No ratings yet
(2022) CIS Controls Cloud Companion Guide - CIS
55 pages
The DELIAN Project: Democracy Through Technology - Clinton Foundation
No ratings yet
The DELIAN Project: Democracy Through Technology - Clinton Foundation
4 pages
The Influence of Employee Personality On Information Security Joa Eng 1007
No ratings yet
The Influence of Employee Personality On Information Security Joa Eng 1007
7 pages
FTC 2020 0045 0001 - Content
No ratings yet
FTC 2020 0045 0001 - Content
3 pages
Theft Perception - Democracy Fund Voter Study Group
No ratings yet
Theft Perception - Democracy Fund Voter Study Group
36 pages
Fintech and Data Protection
No ratings yet
Fintech and Data Protection
12 pages
Wikimedia V Nsa Fourth Circuit Ruling
No ratings yet
Wikimedia V Nsa Fourth Circuit Ruling
68 pages
Facebook Gamestudioas
No ratings yet
Facebook Gamestudioas
143 pages
Fintech Report
No ratings yet
Fintech Report
44 pages
Stewart Weiss - Unix Lecture Notes (2020, Hunter College)
No ratings yet
Stewart Weiss - Unix Lecture Notes (2020, Hunter College)
536 pages
Computer System Servicing: First Quarter - Week 4
No ratings yet
Computer System Servicing: First Quarter - Week 4
27 pages
Ect458 Scheme
No ratings yet
Ect458 Scheme
6 pages
Introduction To Logic Circuits & Logic Design With Verilog: Brock J. Lameres
No ratings yet
Introduction To Logic Circuits & Logic Design With Verilog: Brock J. Lameres
536 pages
Nursing and Computer
No ratings yet
Nursing and Computer
59 pages
MG5100 en PDF
No ratings yet
MG5100 en PDF
941 pages
Hwk1 Solutions
No ratings yet
Hwk1 Solutions
13 pages
IBM VIOS 2.2.4.20 Release Notes - Nov 2016
No ratings yet
IBM VIOS 2.2.4.20 Release Notes - Nov 2016
26 pages
K1Pro Manual 202007
No ratings yet
K1Pro Manual 202007
103 pages
Storage Manager For SANs Step-By-Step Guide
No ratings yet
Storage Manager For SANs Step-By-Step Guide
18 pages
Latitude 14 5490 Laptop Setup Guide3 en Us
No ratings yet
Latitude 14 5490 Laptop Setup Guide3 en Us
2 pages
OS Question Bank 1
No ratings yet
OS Question Bank 1
10 pages
MVI56E MCMMCMXT Datasheet
No ratings yet
MVI56E MCMMCMXT Datasheet
4 pages
Data Formats
No ratings yet
Data Formats
89 pages
DSP
No ratings yet
DSP
190 pages
Sptve Icf 7 Q1 As5
No ratings yet
Sptve Icf 7 Q1 As5
2 pages
Datasheet SU800 EN 20180905 PDF
No ratings yet
Datasheet SU800 EN 20180905 PDF
2 pages
PL - Platinum 05 - 07 - 2021
No ratings yet
PL - Platinum 05 - 07 - 2021
52 pages
6X1VLCE14 R9-gpx HS-2 Manual EN
No ratings yet
6X1VLCE14 R9-gpx HS-2 Manual EN
80 pages
Memory Management in Linux
No ratings yet
Memory Management in Linux
5 pages
The Central Processing Unit (CPU)
No ratings yet
The Central Processing Unit (CPU)
14 pages
VVBVBVBB
No ratings yet
VVBVBVBB
19 pages
Virtual Disk Operations
No ratings yet
Virtual Disk Operations
79 pages
OFP ICT QNs
No ratings yet
OFP ICT QNs
53 pages
Operating System Case Study: Linux
No ratings yet
Operating System Case Study: Linux
61 pages
AltaVault Deployment Guide Updated PDF
No ratings yet
AltaVault Deployment Guide Updated PDF
58 pages
Implementing Symantec EV With NetApp Snaplock
No ratings yet
Implementing Symantec EV With NetApp Snaplock
26 pages
Universiti Pendidikan Sultan Idris Course Curiculum and Instructional Plan
No ratings yet
Universiti Pendidikan Sultan Idris Course Curiculum and Instructional Plan
17 pages
Question Bank CPIT 210
No ratings yet
Question Bank CPIT 210
8 pages

58076778-Node Javier Ramirez - AWS PDF

Uploaded by

58076778-Node Javier Ramirez - AWS PDF

Uploaded by

Building Data Lakes and Analytics

Organic revenue growth

24% Organizations that successfully

Overnight DB dump Hadoop Kafka/RabbitMQ Kafka/Spark Kafka/Flink (JAVA or Scala

• Security and monitoring are hard

• Most of my time my cluster is sitting idle; Then it’s a bottleneck

• I don’t have the time to experiment

• Data preparation, cleansing, and basic transformations take a

• Empty strings. Literal ”NULL” strings

• Uppercase and Lowercase

• CSV, especially if uploaded by end users

• JSON files with a single array and 200.000 records inside

“ Watching paint dry is exciting in comparison to writing and maintaining Extract

Data Lake on AWS

AWS Storage AWS Database AWS Snowball,

Amazon Kinesis Data Amazon Kinesis Data Amazon Kinesis

Durability, Availability Security and

• Designed to deliver 99.999999999% durability

• Geographic redundancy & automatic replication

• Store data in multiple data centers across 3 AZs in

• Seamlessly replicates data between any region

• S3 has trillions of objects and exabytes of data

• Built to store any amount of data

• Runs on the world’s largest global

Durability, Availability Retrieves data in

Building training sets

Cleaning and organizing data

Collecting data sets

Mining data for patterns

“ Dark data are the information

• Computes statistics to make queries efficient

Crawlers automatically build your Data

Automatically discover new data, extracts

Crawlers Built-in classifiers for popular types; custom

• Automatically generates ETL code. Spark

• Code is customizable (demo later on. Yay!)

• Endpoints provided to edit, debug,

• Jobs are scheduled or event-based

Data Lake on AWS

Easy Low cost Use S3 storage Latest versions

Fast at scale Open file formats Secure Inexpensive

• Join data across Redshift and S3

• Scale compute and storage separately

• CSV, ORC, Avro, & Parquet data formats

• Pay only for the amount of data scanned

Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017

Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017

Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017

Query Instantly Pay per query Open Easy

Fast analysis Serverless

Now with ML superpowers!

Data Lake on AWS

Machine Learning Analytics

Scalable and durable

Security Identity Encryption Compliance

Amazon GuardDuty AWS IAM AWS Certification Manager AWS Artifact

AWS Shield AWS SSO AWS Key Management Amazon Inspector

ISO 9001 DoD SRG MPAA My Number Act [Japan]

ISO 27001 FedRAMP NIST

Machine Learning Analytics

Scalable and durable

“…the scale at which AWS operates its public

For example: Amazon S3 holds trillions of objects and

Machine Learning Analytics

Scalable and durable

Traditional: Rigid • EMR and Athena can automatically scale down

Databases Analytics Blockchain

S3/Amazon Glacier Lake Formation

Gain up-to-the-minute understanding

Entire analytics platform running on AWS

Real-time analytics done through Spark on EMR,

4 Configure and enforce

Build a data lake in days, Enforce security policies Combine different

OLTP IAM KMS

“I can’t wait for my team to get our hands on AWS Lake

You might also like