AWS Data Lake
AWS Data Lake
"A data lake is a vast pool of raw data, the purpose for
which is not yet defined. A data warehouse is a repository
for structured, filtered data that has already been
processed for a specific purpose."
Source: https://www.talend.com/resources/data-lake-vs-data-warehouse/,
https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.
Data Lake
Reference: AWS,
https://aws.amazon.com/products/storage/data-lake-
storage/infographic/
1. Storage
2. Governance
3. Analytics
Redshift
Elasticsearch
Splunk
File
Gateway
Volume
Storage Gateway S3
Gateway
Tape
Gateway
S3
Snowball Appliance
S3
Snowmobile Container
Image Credit: AWS, https://aws.amazon.com/snow/
AWS SDK
Redshift EMR
Data Lake
Variety of formats
Compressed Storage
Parquet Columnar
Extensive Tool Support
List of tools:
https://aws.amazon.com/emr/features/
Generate ETL
Script
Run in Spark
(EMR)
Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.
Data Transformation
Thousands of sources
Small Payloads
EMR
Business
Storage Intelligence
Machine
Learning
Analysis
Ingest Process
Response
Video Playback
Monitoring
Kinesis Video
Stream
Rekognition
ML
Kinesis Data
Custom real-time applications Analytics
EMR
Kinesis Data
Stream
EC2
Lambda
S3
Use existing tools to analyze
streaming data
Redshift
Kinesis Firehose
Elasticsearch
Splunk
Data Streams
SQL Kinesis Data Kinesis
Analytics Firehose
Firehose
Kinesis
AWS Data Stores
SQL S3
Athena
Data Lake
“This makes vast amount of unstructured data accessible to any data lake user
who can use SQL.”
Reference: Data Lake on AWS,
https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/building-data-lake-aws.html
Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.
Redshift Spectrum In-place Query
Athena
• Ad-hoc data discovery and SQL querying
Redshift Spectrum
• More complex queries
• Large number of users
Data Streams
SQL Kinesis Data Result
Destination
Analytics
Firehose
Artificial Video, Image, Natural Analyze audio, video, image, text data in
Intelligence language processing S3
CloudWatch Metric
Logs CloudWatch
Logs Log
Logs
• Consolidate Logs
• Monitor
Log
API Calls SQL
AWS CloudTrail Athena
Intelligent Tiering
Data Formats
Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.
S3 Lifecycle Management
Object Age
S3 Object Tags
Reference: S3,
https://docs.aws.amazon.com/AmazonS3/latest/dev/analytics-storage-class.html
Object Accessed
Intelligent Tiering
Data Formats
Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.
Security and Protection
• Data Lake is centralized
• Consolidates all data in one place
• Protecting and managing data is very important
Resource-based Policy
User-based Policy
Bucket
Object
Bucket
Object
Bucket
Object
Bucket
Default Key
Object KMS
With default key, S3 automatically decrypts object for any user who is allowed
access to the bucket or object
Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.
S3 Data Encryption – Customer Master Key (CMK)
Default Key
KMS
Bucket
Customer Master Key
KMS
Object
Client
Bucket
Object
Encrypted
Object Object
KMS
1 2 3 4 5 6
AZ 1 AZ 2 AZ 3
A.1
Configure Lifecycle rules for current and previous versions
Source Replicated
Bucket Bucket
Region 1 Region 2
Classification=PHI
ALLOW Classification=PHI Object
DENY Classification=PHI
Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.
Security and Protection
Chandra Lingam
Cloud Wave LLC
https://aws.amazon.com
• Always Free
• 12 months free and
• Trials
https://aws.amazon.com/free
https://aws.amazon.com/free/free-tier-faqs/
Billing Alerts
Dashboard
• Free-Tier usage
• Monthly Charge Summary
• Itemized charges
• Past Bills and Usage
• Account Issues
• Billing Enquires
• Service Limit Changes
Sign-in Link
https://<AccountId>.signin.aws.amazon.com/console
https://<Alias>.signin.aws.amazon.com/console
Account Setup
Support
• Storage Class
• S3 Versioning
• Age Based Retention
• Storage Tiering
• Replication
• Encryption with SS3-S3 and KMS
• Query by wildcard
SELECT * FROM "demo_db"."iris_csv"
where class like '%setosa%’;
• Get a count
SELECT count(*) AS COUNT FROM "demo_db"."iris_csv"
Reference:
https://s3.amazonaws.com/amazon-reviews-pds/readme.html
https://registry.opendata.aws/
Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.
Example Queries – Customer Review
• Highly Rated Books
SELECT product_title, star_rating,review_body
FROM "demo_db"."amazon_reviews_parquet"
WHERE product_category = 'Books'
and star_rating > 3
limit 10;
Assess
Ingest Store Query
Sentiment
Kinesis
S3 Athena
Firehose
Lambda Comprehend
Assess Sentiment