100% found this document useful (1 vote)
190 views

AWS Data Lake

Uploaded by

chatgpt
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
190 views

AWS Data Lake

Uploaded by

chatgpt
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 104

Data Lake

Streamline Data Management


Chandra Lingam
Cloud Wave LLC

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Date Lake Vs Data Warehouse

"A data lake is a vast pool of raw data, the purpose for
which is not yet defined. A data warehouse is a repository
for structured, filtered data that has already been
processed for a specific purpose."

Reference: Talend, https://www.talend.com/resources/data-


lake-vs-data-warehouse/

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Data Lake Vs Data Warehouse
Data Lake Data Warehouse

Data Structure Raw Processed, Structured and


Well understood
Tools In-place Querying, ad hoc SQL, BI

Purpose of Data Not yet determined Currently In Use

Users Data Scientists, Data Business Professionals


Explorers
Accessibility Highly accessible and quick More complicated and costly
to update to make changes

Source: https://www.talend.com/resources/data-lake-vs-data-warehouse/,
https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.
Data Lake

“A data lake is a centralized repository that allows you to


migrate and store all structured and unstructured data at
unlimited scale…”

Reference: AWS,
https://aws.amazon.com/products/storage/data-lake-
storage/infographic/

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


AWS - Whitepaper

1. Storage
2. Governance
3. Analytics

Data Lake on AWS:


https://docs.aws.amazon.com/whitepapers/latest/building-data-
lakes/building-data-lake-aws.html

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Storage

Service Purpose Use


S3 Storage Object Storage to store and retrieve any amount
of data.
Cost effective with 99.999999999% (11 9s) of
durability

Object Life cycle management


Glacier Backup and Backup and Long term archival (multi-year)
Archiving
Extremely low cost and 11 9s durability.

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Data Lake Storage

Hot Warm Cold

S3 Standard S3 Infrequent Access Glacier


Cost - 500GB
USD 11.50 USD 6.25 USD 2.00
per month
Durability 99.999999999% (11 9’s)
Suitable for Frequently Accessed Less Frequently Accessed Long term archival
First byte Restore can take
Immediate Immediate
latency minutes to hours

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Ingestion
Service Purpose Use
Kinesis Firehose Real-time Streaming Capture and deliver real time
Data Ingestion streaming data directly to S3,
Redshift, Elasticsearch, Splunk
Storage Gateway Hybrid Cloud Storage Integrate legacy on-premises
data processing platforms to S3
Data Lake
Snowball, Migration (Large scale) Physically move petabytes to
Snowmobile exabytes of data to AWS cloud at
1/5th the cost of internet transfer
SDK, CLI and Custom Ingestion Easy to integrate with variety of
more tools

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Realtime Streaming Data
S3

Redshift

Kinesis Firehose Destination

Elasticsearch

Splunk

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Storage Gateway
On-premises Cloud

File
Gateway

Volume
Storage Gateway S3
Gateway

Tape
Gateway

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Snowball
On-premises Cloud

S3

Snowball Appliance

Image Credit: AWS, https://aws.amazon.com/snow/

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Snowmobile
On-premises Cloud

S3

Snowmobile Container
Image Credit: AWS, https://aws.amazon.com/snow/

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Ingestion

AWS SDK

AWS Command Line Tool

Third party tools

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Data Catalog

Make data discoverable and usable

Track versions of changes

Queryable interface for all data assets

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Data Catalog
Service Purpose Use
Do-it-yourself Comprehensive Data Make data discoverable and usable.
Catalog
Use services like Lambda,
Elasticsearch, DynamoDB to collect
and maintain metadata
Glue Managed Data Catalog Make data discoverable and usable.

Automatically crawl and collect


metadata from S3, DynamoDB and
any other databases that supports
JDBC connectivity

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Catalog
Image Credit: HomeDepot
Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.
Image Credit: webhamster, flickr

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Data Swamp

“A data swamp is a deteriorated and unmanaged data lake


that is either inaccessible to its intended users or is
providing little value”

Reference: Data Swamp


https://en.wikipedia.org/wiki/Data_lake

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Do-it-yourself Data Catalog
App

Data Lake DynamoDB

Lambda Data Catalog


Database Elasticsearch

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Glue Data Catalog

Redshift EMR

Data Lake

DynamoDB Glue Crawler Glue Data Catalog


Glue Crawler
Glue Crawler
Relational
Database
Glue ETL Athena

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Data Formats
Popular Formats, Tools for Conversion

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Data Formats

Variety of formats

With optimal format, you can:


• Lower storage cost
• Improve query performance

Question: When and where to do the format conversion?

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Data Formats

“One of the core values of a data lake is that it is the


collection point and repository for all of an organization’s
data assets, in whatever their native formats are”

Reference: Data Lake on AWS,


https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/building-
data-lake-aws.html

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Data Formats

Collect data in native format

Transform data in data lake

Organize by row or column


• Row Store – Optimized for reading entire row
• Column Store – Optimized for reading a subset of columns

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Text Data Formats
Format Organization Use
Easy to use

No data type support

CSV, TSV Row Duplication with hierarchies - For example, in an employee-


department CSV file, department information is duplicated for
every employee

Not optimized for reading only specific columns


Format of choice for communication between web services

Supports data types


JSON,
Row
JSON Lines Efficiently represent hierarchical data

JSON Lines – A record is stored in a line


Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.
Binary Data Formats
Format Organization Use
Ideal for use cases that require only subset of columns

Efficiently query large amount of data

Write Once Read Many (WORM)

Compressed Storage
Parquet Columnar
Extensive Tool Support

Data Type Support

Reduce storage footprint, improve query performance and


lower query cost
Parquet Performance: https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/monitoring-
optimizing-data-lake-environment.html
Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.
Binary Data Formats
Format Organization Use

ORC Columnar Like Parquet

Ideal for write-heavy use cases

Avro Row Efficiently read the entire record

Data Type Support

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Data Transformation

Collect in native format

Transform in data lake

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Data Transformation

Service Purpose Use


Amazon EMR Data Processing Managed Hadoop environment

Support for tools like Spark, Hive, HBase

Support for ML tools like TensorFlow and MXNet

List of tools:
https://aws.amazon.com/emr/features/

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Amazon EMR – Format Conversion

Source Format Hive (EMR) Target Format

Source Format Spark (EMR) Target Format

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Data Transformation

Service Purpose Use


Glue Managed ETL Automatically Generate ETL Scripts

Schedule and Run in Spark

Support for Scala and Python

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Glue ETL – Generate and Run Script

Source Glue Target

Generate ETL
Script

Run in Spark
(EMR)
Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.
Data Transformation

Service Purpose Use


Kinesis Firehose Streaming Data Transform streaming data to Parquet, ORC
Transformation formats

Deliver transformed data to AWS Data Stores

Optionally, backup original data to S3

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Stream and Batch Processing

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Streaming Data
Generated Continuously

Thousands of sources

Small Payloads

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Batch Processing

EMR

Business
Storage Intelligence

Machine
Learning

Analysis

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Batch Processing Use Cases

Utility bill generation

Daily, monthly manufacturing reports

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Stream Processing

Ingest Process

Latency in Seconds or Minutes

Response

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Amazon Kinesis
Collect, Process, Analyze Streaming Data

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Amazon Kinesis

“Amazon Kinesis enables you to ingest, buffer and process


streaming data in real-time”

“you can derive insights in seconds or minutes.”

“Handle any amount of streaming data from hundreds of


thousands of sources with very low latencies”

Reference: Amazon Kinesis, https://aws.amazon.com/kinesis/


Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.
Kinesis Family
Service Purpose Use

Capture and Analyze Security Monitoring, Video Playback, Face


Video Streams
Video Stream detection

Capture and Analyze


Data Streams Custom real-time application
Data Stream

Capture and Deliver


Use Existing BI tools for Streaming Data:
Firehose Data Stream to AWS Data
S3, Redshift, ElasticSearch, Splunk
Stores

Analyze Data Stream with


Data Analytics Real-time analytics, Anomaly detection
SQL and Java
Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.
Kinesis Video Streams

Video Playback

Monitoring
Kinesis Video
Stream
Rekognition

ML

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Kinesis Data Streams

Kinesis Data
Custom real-time applications Analytics

EMR
Kinesis Data
Stream
EC2

Lambda

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Kinesis Data Firehose

S3
Use existing tools to analyze
streaming data

Redshift

Kinesis Firehose

Elasticsearch

Splunk

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Kinesis Data Analytics

Use SQL to analyze data streams

Data Streams
SQL Kinesis Data Kinesis
Analytics Firehose
Firehose

Kinesis
AWS Data Stores

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


In-place Querying
• Directly query data in S3 using SQL
• Athena, Redshift Spectrum

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Athena In-place Query

SQL S3
Athena
Data Lake

• Directly run SQL query against files in S3


• No need to provision servers (serverless)
• Charges based on amount of data scanned
• Support for popular file formats: CSV, JSON, Parquet, ORC, Avro

“This makes vast amount of unstructured data accessible to any data lake user
who can use SQL.”
Reference: Data Lake on AWS,
https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/building-data-lake-aws.html
Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.
Redshift Spectrum In-place Query

BI Tools • Sophisticated Query


Optimization
• Distribute query across multiple
Redshift Spectrum nodes
• Redshift Data Warehouse SQL
Syntax
Redshift Cluster
• Use with existing BI tools
• Query can span Redshift Tables
S3 Data Lake
and S3 Data Lake

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


AWS Recommendations

Athena
• Ad-hoc data discovery and SQL querying

Redshift Spectrum
• More complex queries
• Large number of users

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Streaming Query Kinesis Data Analytics

Data Streams
SQL Kinesis Data Result
Destination
Analytics
Firehose

Kinesis • Use SQL to query streaming data


• Continuously running query
• Sends matching results to configured destination

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Analytics Tools
• Data Lake needs to support current and future tools
• S3 is a popular cloud service
• Several third-party tools natively support S3

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Broader Analytics Portfolio
Service Purpose Use

Process data in S3 using


Amazon EMR Hadoop Ecosystem tools Spark, Hive, Pig, Hbase, TensorFlow,
MxNet and so forth

Train models with data in S3


SageMaker Machine Learning
Generate real-time and Batch predictions

Artificial Video, Image, Natural Analyze audio, video, image, text data in
Intelligence language processing S3

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Broader Analytics Portfolio
Service Purpose Use

Create Interactive Dashboard


QuickSight Business Intelligence Supports Athena, Redshift, Relational
database
Petabyte Scale
Load data to tables from S3 - local querying
Redshift Data warehouse
Query S3 directly using Redshift Spectrum
(Columnar Storage)

Business Logic Serverless code execution


Lambda
(Function as a service) Trigger-based function invocation

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Monitoring and Optimization

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Monitoring
Service Purpose Use

Monitor your resources


CloudWatch Monitoring Configure Alarms to alert
Take automated action

CloudWatch Log Log Monitoring Consolidate log files and monitor

Log all activities and who performed those


actions
CloudTrail Audit Trail
Useful for investigation, compliance
monitoring

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


CloudWatch Log

CloudWatch Metric
Logs CloudWatch
Logs Log
Logs

• Consolidate Logs
• Monitor

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


CloudTrail

Log
API Calls SQL
AWS CloudTrail Athena

• Log all activities and who performed those actions


• Useful for investigation, compliance monitoring

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Optimization

“Data storage is often a significant portion of the costs


associated with a data lake.”

Reference: Data Lake on AWS,


https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/building-data-lake-
aws.html

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Cost Optimization
S3 Lifecycle Management

S3 Storage Class Analysis

Intelligent Tiering

Glacier and Glacier Deep Archive

Data Formats
Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.
S3 Lifecycle Management

Hot Warm Cold

S3 Standard S3 Infrequent Access Glacier


Cost - 500GB
USD 11.50 USD 6.25 USD 2.00
per month
Retrieval Fee - Per GB Per GB
Suitable for Frequently Accessed Rarely Accessed Archival and Backup
First byte Restore can take
Immediate Immediate
latency minutes to hours
Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.
Lifecycle Storage Tiering and Expiration

Object Age

Name and Folder Structure

S3 Object Tags

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


S3 Lifecycle Management

Hot Warm Cold

S3 Standard S3 Infrequent Access Glacier


Cost - 500GB
USD 11.50 USD 6.25 USD 2.00
per month
Retrieval Fee - Per GB Per GB
Suitable for Frequently Accessed Rarely Accessed Archival and Backup
First byte Restore can take
Immediate Immediate
latency minutes to hours
Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.
Storage Class Analysis

“One of the challenges of developing and configuring


lifecycle rules for the data lake is gaining an understanding
of how data assets are accessed over time.”

Reference: Data Lake on AWS,


https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/building-
data-lake-aws.html

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Storage Class Analysis

“This new Amazon S3 analytics feature observes data access patterns


to help you determine when to transition less frequently accessed
STANDARD storage to the STANDARD_IA storage class”

Reference: S3,
https://docs.aws.amazon.com/AmazonS3/latest/dev/analytics-storage-class.html

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


S3 Intelligent Tiering

Objects are automatically moved between frequent access


and infrequent access storage class

Object not accessed for 30 days

Frequent Access Infrequent Access

Object Accessed

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Glacier, Glacier Deep Archive
Service Purpose Use

Cost: USD 2.00 for 500 GB/Month


Durability: 11 9’s
Glacier Archive and Backup
Retrieval Time: Minutes to Hours
Vault Lock to prevent future edits

Cost: USD 0.50 for 500 GB/Month


Glacier Durability: 11 9’s
Archive and Backup
Deep Archive Retrieval Time: 12 to 48 hours
Vault Lock to prevent future edits

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Cost Optimization
S3 Lifecycle Management

S3 Storage Class Analysis

Intelligent Tiering

Glacier and Glacier Deep Archive

Data Formats
Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.
Security and Protection
• Data Lake is centralized
• Consolidates all data in one place
• Protecting and managing data is very important

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


S3 Access Management

Resource-based Policy

User-based Policy

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


S3 Resource Based Policy

Bucket

Object

Permissions are embedded as part of Bucket and Object


Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.
S3 User Based Policy

Bucket

Object

Permissions are granted to Users and Groups


Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.
S3 User and Resource Based Policy

Bucket

Object

Deny all access that do not originate from on-premises


Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.
S3 Data Encryption – Default Key

Bucket

Default Key

Object KMS

With default key, S3 automatically decrypts object for any user who is allowed
access to the bucket or object
Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.
S3 Data Encryption – Customer Master Key (CMK)
Default Key

KMS
Bucket
Customer Master Key

KMS

Object

Customer Master key provides additional layer of security and control

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


S3 Client-Side Encryption – Customer Master Key (CMK)
Object encryption and decryption is client responsibility

Client
Bucket

Object

Encrypted
Object Object

KMS

Customer Master Key


Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.
Protection

“A data lake must protect data against corruption, loss,


accidental or malicious overwrites, modifications, and
deletions.”

Reference: Data Lake on AWS,


https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/building-
data-lake-aws.html

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


S3 Durability

S3 Durability 99.999999999% (11 9’s)

Measure of protection against data loss and corruption

1 2 3 4 5 6

AZ 1 AZ 2 AZ 3

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


S3 Versioning

Protection against accidental and malicious deletes


A.3
S3 maintains versions of objects A.2

A.1
Configure Lifecycle rules for current and previous versions

Multi-Factor Authentication (MFA) for additional layer of


authentication

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


S3 Cross Region Replication (CRR)

Source Replicated
Bucket Bucket
Region 1 Region 2

Replicate S3 bucket in another region for Disaster Recovery

Automatic and continuous replication

Deletes are not replicated


Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.
S3 Object Tagging
Tags are additional meta-data that you can add to Object
Define access control policies based on tags

Classification=PHI
ALLOW Classification=PHI Object

DENY Classification=PHI
Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.
Security and Protection

AWS and S3 provides several features to secure and


protect your data

As part of Shared Responsibility Model, Customers are


responsible for configuring these security features
according to their organization needs

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Summary
S3 Data Lake Architecture provides a template on how to
design and run a data lake for your organization
• Ingest and Store Data
• Discover and Make data usable
• Transform data
• Analyze data in-place
• Future proofing
• Monitor
• Optimize
• Security and Protection

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


AWS Housekeeping
Account Setup, Support

Chandra Lingam
Cloud Wave LLC

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Hands-on Experience

“Gain free, hands-on experience with the AWS platform,


products, and services.”

https://aws.amazon.com

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Three Types of Offers

• Always Free
• 12 months free and
• Trials

https://aws.amazon.com/free
https://aws.amazon.com/free/free-tier-faqs/

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Billing

You are billed standard pay-as-you-go rates when -


• Usage exceeds free tier limits or
• Term expires

AWS requires a Credit or Debit card to sign-up for an


account

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Billing

Billing Alerts

Dashboard
• Free-Tier usage
• Monthly Charge Summary
• Itemized charges
• Past Bills and Usage

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Free Support Center

• Account Issues
• Billing Enquires
• Service Limit Changes

• Technical Support – Part of paid plans

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html
https://docs.aws.amazon.com/general/latest/gr/sagemaker.html

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Billing Alert - Best Practices

• Enable billing access to authorized users in your


account
• Configure Free Tier Alerts
• Enable billing data collection for CloudWatch monitoring
• Configure Billing Alarms with CloudWatch
• Configure AWS Budget

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


User Accounts
Account/User Purpose

Root Account Responsible for paying bills. Sign-in at


(Highest Privilege) https://aws.amazon.com/
Enable MFA

my_admin IAM User with administrative access

Sign-in Link
https://<AccountId>.signin.aws.amazon.com/console
https://<Alias>.signin.aws.amazon.com/console

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


MFA Setup

Recommended for root account

Login credentials + one-time passwords


• Google Authenticator App or similar

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Summary

Account Setup

Types of free offers

Billing Dashboard and Alerts

Support

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Lab – AWS Account Setup

Create AWS Account

Configure user and permission

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Lab – S3

• Storage Class
• S3 Versioning
• Age Based Retention
• Storage Tiering
• Replication
• Encryption with SS3-S3 and KMS

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Lab – Glue Data Catalog and Athena

In-place Querying of files stored in S3


• Store file in S3
• Collect metadata with Glue Crawler
• Run Query using Athena

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Example Queries (Lab)
• Query first 10 rows
SELECT * FROM "demo_db"."iris_csv" limit 10;

• Query for a specific class


SELECT * FROM "demo_db"."iris_csv"
WHERE class = 'Iris-setosa’;

• Query by wildcard
SELECT * FROM "demo_db"."iris_csv"
where class like '%setosa%’;

• Get a count
SELECT count(*) AS COUNT FROM "demo_db"."iris_csv"

• Compute new columns


SELECT sepal_length, sepal_width,
sepal_length * sepal_width as sepal_area
FROM "demo_db"."iris_csv";

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Lab – Glue ETL

Use Glue ETL to convert files to Parquet format


• Glue automates process of ETL script generation,
scheduling and execution
• Glue ETL provisions required Apache Spark
infrastructure to run the job

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Example Queries - Parquet (Lab)
• Query Iris Parquet Table
SELECT sepal_length, sepal_width,
sepal_length * sepal_width as sepal_area
FROM "demo_db"."iris_parquet" limit 10;

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Lab – Customer Review

Query Amazon Customer Reviews Public Dataset using


Athena
• Create table definition (instead of using Glue Crawler)
• Update catalog with partition
• Query using Athena

Reference:
https://s3.amazonaws.com/amazon-reviews-pds/readme.html
https://registry.opendata.aws/
Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.
Example Queries – Customer Review
• Highly Rated Books
SELECT product_title, star_rating,review_body
FROM "demo_db"."amazon_reviews_parquet"
WHERE product_category = 'Books'
and star_rating > 3
limit 10;

• Book Reviews for specified book title pattern


SELECT product_title, star_rating,review_body
FROM "demo_db"."amazon_reviews_parquet"
WHERE product_category = 'Books'
and product_title like 'Harry Potter%'
and star_rating > 3
limit 100;

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Lab – Sentiment of the Customer Review

Find Sentiment of the customer review using Comprehend


AI Service

With Athena, Query the reviews using sentiment

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Lab – Serverless Customer Review Solution

Assess
Ingest Store Query
Sentiment

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.


Lab – Serverless Customer Review Solution

Ingest Store Query

Kinesis
S3 Athena
Firehose

Lambda Comprehend

Assess Sentiment

Copyright © 2019 ChandraMohan Lingam. All Rights Reserved.

You might also like