0% found this document useful (0 votes)
111 views47 pages

Modernize Your Analyticsand Data Architecture

Uploaded by

reach2ashish5065
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views47 pages

Modernize Your Analyticsand Data Architecture

Uploaded by

reach2ashish5065
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Modernize Your Analytics and Data Architecture

Sriram Kuravi
Partner Solutions Architect

© 2019, Amazon Web Services, Inc. or its Affiliates.


Agenda

• Why a Data Lake Architecture


• Components of Data Lake on AWS
• Ingestion
• Storage
• Transformation
• Analytics & Querying
• Best practices and Common Architectures
• Modern Data Architecture

© 2019, Amazon Web Services, Inc. or its Affiliates.


real-time
To create value, companies must derive insights
from a variety of data sources that are
producing data at high volume and velocity


© 2019, Amazon Web Services, Inc. or its Affiliates.
… but realizing value from data is challenging

What’s holding you back from using data?

Unable to link data together Data collected too infrequently Data difficult to access

96% 39% 66%


of organizations of organizations do not of organizations find data is
state data is not used regularly collect data to access difficult

Without the right platform, data insights remain elusive


Why Data Lakes?

Data loses value over time


Value of data to decision-making

Time-critical decisions Traditional “batch” business intelligence


Preventive/predictive

Actionable Reactive Historical


Real time Seconds Minutes Hours Days Months
Source: Mike Gualtieri, Forrester, Perishable insights
© 2019, Amazon Web Services, Inc. or its Affiliates.
Why Data Lakes (contd..)
Increase speed to which information is curated, added to the platform and access is provided
to derive business value.

Data Storage Data Catalog Access Controls • Democratize Data Access


to accelerate more insights

• Easily perform new types


of data analysis and data
science.

Collects Quickly search Enforce security to • Query the data by defining


everything at and find the protect data stored in
scale and at relevant data the central repository the data’s structure at the
low costs time of use

© 2019, Amazon Web Services, Inc. or its Affiliates.


Data lake infrastructure
Challenges to making a secure data lake & management

Typical steps of building a data lake

1 Setup storage

4 Configure and enforce security


and compliance policies
2 Move data
3 Cleanse, prep,
and catalog data Make data available
5 for analytics
ROLE PRIORITIES NEEDS
Makes sense of data, generates and communicates
Ad hoc querying
Data Scientist insights to improve or create business processes,
Robust ML tools
creates predictive ML models to support them

Builds scalable pipelines, transforms and loads data


Data Ad hoc querying
into structures complete with metadata that can be
Engineer Quick visualization
readily consumed by DS

Data Product Manages data as a product. Ensures freshness and


consistency of data; understands lineage and Reports – data quality, errors
Manager
compliance needs; treats DS as customers

DevOps Monitoring for reliability, quickly diagnose Ad hoc querying


Engineer deployment or availability issues Dashboards

Visualization
Creating engaging visual and narrative journeys
Data Visualizer for analytical solutions
Dashboards
Reporting

Business Vetting the priortization and ROI, funding projects, Reporting


Sponsor providing ongoing feedback Dashboards

© 2019, Amazon Web Services, Inc. or its Affiliates.


Why choose AWS for data lakes and analytics?

Easiest to build Most secure Most Most scalable


data lakes and infrastructure for comprehensive and cost
analytics analytics and open effective
Data lakes and analytics components

Data, visualization, engagement, & machine learning

Data Dashboards Digital User Engagement Predictive Analytics

Analytics
Data Big Data Serverless Interactive Operational Real time
Warehousing Processing Data processing Query Analytics Analytics

Data lake infrastructure & management

Infrastructure Security & Data Catalog


Management & ETL

Data movement

Migration & Streaming Services


The AWS analytics portfolio

Data, visualization, engagement, & machine learning


NEW

Data QuickSight Pinpoint SageMaker Comprehend Le Polly Rekognition Translate


Exchange x
+ many more

Analytics
EMR (Spark & AWS Glue Elasticsearch Kinesis Data
Redshift (Spark & Athena
Hadoop) Service Analytics
Python)

Data lake infrastructure & management

S3/Glacier Lake AWS Glue


Formation

Data movement

Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams | Managed Streaming for Apache Kafka
Most ways to move data to the data lake Data
movement

Professional services and partners


to help migration
Data movement from
on-premises datacenters
Amazon S3 Dedicated network connection
Amazon Glacier Secure appliances
AWS Glue
Ruggedized shipping containers
Database migration
Gateway that lets applications write to the cloud
Data movement from Data movement from
your on-premises real-time sources
datacenters Data movement from real-time sources
Connect devices to AWS
Synchronizing data Real-time data streams
across environments
Real-time video streams
Real-time streaming on AWS
Easily collect, process, and analyze data and video streams in real time

Amazon Kinesis Amazon Kinesis Amazon Kinesis Amazon Managed Amazon Kinesis
Data Streams Data Firehose Data Analytics Streaming for Video Streams
Apache Kafka

Collect and Load data streams Analyze data Collect and Capture, process,
store data into AWS data streams with store data and store media
streams for stores SQL or Java streams for streams for playback
analytics analytics and analytics

© 2019, Amazon Web Services, Inc. or its Affiliates.


Data streams are for real time

Producer writes to
a partition

Consumer reads

© 2019, Amazon Web Services, Inc. or its Affiliates.


Streaming data architecture

Data streaming technology enables customers to ingest, process, and analyze


high volumes of high-velocity data from a variety of sources in real time

Data sources Stream ingestion Stream storage Stream processing Destination


Devices and/or Data from tens of Data is stored in the Records are read in Data lakes,
applications that thousands of data order in which it was the order in which they databases, and
produce real-time sources can be written received for a set are produced, enabling analytics services
data at high velocity to a single stream duration, and it can real-time analytics or
be replayed streaming ETL
indefinitely during
this time

© 2019, Amazon Web Services, Inc. or its Affiliates.


Stream ingestion
Data from tens of thousands of data sources can be written to a single stream

AWS toolkits/libraries AWS service integrations Third-party offerings

AWS SDK AWS IoT Core Log4j


Amazon Kinesis
Producer Amazon CloudWatch Logs
Library (KCL) Flume

AWS Mobile Amazon CloudWatch Events


Fluentd
SDK

AWS Database Migration


Amazon
Service (AWS DMS)*
Kinesis Agent

* AWS DMS includes eight on-premises databases, one Azure database, five Amazon
RDS/Amazon Aurora database types, and Amazon Simple Storage Service (Amazon S3)

© 2019, Amazon Web Services, Inc. or its Affiliates.


Thomson Reuters: Real-time dashboards


Thomson Reuters provides professionals


with the intelligence, technology, and human
expertise they need to find trusted answers

© 2019, Amazon Web Services, Inc. or its Affiliates.


Why Amazon S3 for a Data Lake?

Durable Available High performance


 Multiple upload
Designed for 11 9s Designed for  Range GET
of durability 99.99% availability

Easy to use Scalable Integrated


 Simple REST API  Store as much as you need  Amazon EMR
 AWS SDKs  Scale storage and compute  Amazon Redshift
 Read-after-create consistency independently  Amazon DynamoDB
 Event notification  No minimum usage commitments
 Lifecycle policies

© 2019, Amazon Web Services, Inc. or its Affiliates.


Sunflower Genomics Research
Botany department at UBC and UBC Data Science
Institute

Challenge
2048 core SGI mainframe pain points:
• Complex - required intricate job orchestration for
replication and distribution of processes
• Reliability issues – 12% jobs timed out/failed
• Timeliness of analytics – 40 core-years for upgrades;
jobs could take 2 weeks before execution begins

Solution
Deploying a data lake on Amazon S3 with 100TB of data.
Using containers, serverless, and Amazon EventBridge
for monitoring and usage report

Benefit
Improved insight into their research, accuracy in
predicting costs for resource requirements, reduction in
time to science and cost
Processing and Querying In Place

User-Defined Functions Fully Managed Process & Query


• Bring your own functions & code • Catalog, Transform, & Query Data in Amazon S3

• Execute without provisioning servers • No physical instances to manage

Lambda Function

© 2019, Amazon Web Services, Inc. or its Affiliates.


Amazon S3 Select and Amazon Glacier Select

Select subset of data from an object based on a SQL expression

© 2019, Amazon Web Services, Inc. or its Affiliates.


Amazon S3 Select: Serverless MapReduce

Before After
200 seconds and 11.2 cents 95 seconds and costs 2.8 cents
# Download and process all keys # Select IP Address and Keys
for key in src_keys: for key in src_keys:
response = s3_client.get_object(Bucket=src_bucket, response = s3_client.select_object_content
Key=key) (Bucket=src_bucket, Key=key, expression =
contents = response['Body'].read() SELECT SUBSTR(obj._1, 1, 8), obj._2 FROM s3object
for line in contents.split('\n')[:-1]: as obj)
line_count +=1 contents = response['Body'].read()
try: for line in contents:
data = line.split(',') line_count +=1
srcIp = data[0][:8] try:
….
2X Faster at 1/5 of the cost
….

© 2019, Amazon Web Services, Inc. or its Affiliates.


Choosing the Right Data Formats
There is no such thing as the “best” data format
• All involve tradeoffs, depending on workload & tools
• CSV, TSV, JSON are easy, but not efficient
• Compress & store/archive as raw input
• Columnar compressed are generally preferred
• Parquet or ORC
• Smaller storage footprint = lower cost
• More efficient scan & query
• Row oriented (AVRO) good for full data scans

Key considerations are cost, performance & support

© 2019, Amazon Web Services, Inc. or its Affiliates.


Choosing the Right Data Formats (con’t.)
Pay by the amount of data scanned per query

Use Compressed Columnar Formats


• Parquet
• ORC

Easy to integrate with wide variety of tools


Dataset Size on Amazon S3 Query Run time Data Scanned Cost

Logs stored as Text 1 TB 237 seconds 1.15TB $5.75


files
Logs stored in Apache 130 GB 5.13 seconds 2.69 GB $0.013
Parquet format*

Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper

© 2019, Amazon Web Services, Inc. or its Affiliates.


Data Prep is ~80% of Data Lake Work

Building training sets

Cleaning and organizing data

Collecting data sets

Mining data for patterns

Refining algorithms

Other

© 2019, Amazon Web Services, Inc. or its Affiliates.


Transforming Data

Over 90% of ETL jobs in the cloud are hand-coded

Which is good … … but also bad!

• Flexible • Brittle
• Powerful • Error-Prone
• Unit Tests • Laborious
• CI/CD • Sources Change
• Developer Tools … • Schemas Change
• Volume Changes
• EVERYTHING KEEPS CHANGING !!!

© 2019, Amazon Web Services, Inc. or its Affiliates.


AW S G l u e — S e r v e r l e s s D a t a C a t a l o g & E T L

Automatically discovers data and stores schema


ETL Job
Data Catalog
authoring Data searchable, and available for ETL
Auto-generates
Discover data and
customizable ETL code Generates customizable code
extract schema
in Python and Spark
Schedules and runs your ETL jobs

Serverless

© 2019, Amazon Web Services, Inc. or its Affiliates.


AWS Glue: Components

 Hive Metastore compatible with enhanced functionality


 Crawlers automatically extracts metadata and creates tables
 Integrated with Amazon Athena, Amazon Redshift Spectrum, Amazon
Data Catalog
EMR

 Auto-generates ETL code


 Build on open frameworks – Python/Scala and Apache Spark
Job Authoring  Developer-centric – editing, debugging, sharing

 Run jobs on a serverless Spark platform


 Provides flexible scheduling
Job Execution  Handles dependency resolution, monitoring and alerting

© 2019, Amazon Web Services, Inc. or its Affiliates.


Glue Workflow
1. Use a crawler to infer the
schema of your data.
2. Use Classifiers to Populate
the AWS Glue Data
Catalog with table
definitions.
3. Crawler connects to the
data store.
4. Schema inference occurs.
5. Define a job that describes
the transformation of data
from source to target.
6. Run your job to transform
your data.
7. Monitor
© 2019, Amazon Web Services, Inc. or its Affiliates.
Amazon Athena —Interactive Analysis

Interactive query service to analyze data in Amazon S3 using standard SQL


No infrastructure to set up or manage and no data to load
Supports Multiple Data Formats – Define Schema on Demand

Query Instantly Pay per query Open Easy

© 2019, Amazon Web Services, Inc. or its Affiliates.


Data, visualization,
Amazon QuickSight engagement, & ML

First BI service built for the cloud with pay-per-session pricing & ML insights

Elastic Scaling Serverless Deeply integrated API Support


with AWS services

Auto-scale 10 to 10K+ Create dashboards in Secure, Private access to Programmatically onboard users
users in minutes minutes AWS data and manage content
Pay-as-you-go Deploy globally without Integrated S3 data lake Easily embed in your apps
provisioning a single permissions through AWS IAM
server

© 2019, Amazon Web Services, Inc. or its Affiliates.


Custom Cost Analytics Service

Challenge
The UK Home Office needed to build a Cost Analytics service
that internal customers use to consume reports around team
utilization of their shared Kubernetes infrastructure on a
pod level

Solution
Home Office implemented a custom-built Cost Analytics
solution using AWS Lambda, Amazon CloudWatch, Amazon
S3, AWS Glue, Amazon Athena, and Amazon QuickSight

Benefit
Reporting has driven behavioral changes for teams to reduce
costs by right-sizing the storage and compute, using
reserved instances, and scheduling. They are also working
on a Cost Efficiency Rating report that scores teams based
on various savings and efficiency techniques per service.
The solution is driving down costs for the Home Office and
hence, the tax payers.
Predictive insights with AWS ML & AI services
Broadest and most complete set of Machine Learning capabilities
AI SERVICES

VISION SPEECH TEXT SEARCH CHATBOTS PERSONALIZATION FORECASTING FRAUD DEVELOPMENT CONTACT CENTERS

Amazon Amazon Amazon Amazon Amazon Amazon Amazon Amazon Amazon Amazon Amazon Amazon Contact Lens
Rekognition Polly Transcribe Comprehend Translate Textract Kendra Lex Personalize Forecast Fraud Detector CodeGuru
For Amazon Connect
+Medical +Medical

ML SERVICES

SageMaker Studio IDE


Ground AWS
Neo Augmented
Amazon SageMaker Truth Marketplace Model
Built-in Model AI
for ML Notebooks Experiments Processing training & Debugger Autopilot Model Monitor
algorithms hosting
tuning

ML FRAMEWORKS & INFRASTRUCTURE

Deep Learning GPUs & Elastic


Inferentia FPGA
DeepGraphLibrary AMIs & Containers CPUs Inference

© 2019, Amazon Web Services, Inc. or its Affiliates.


Serverless data lake & analytics with AWS

1 2 AMAZON
ATHENA

AMAZON
AWS GLUE AWS GLUE EMR
AMAZON S3
CRAWLER DATA CATALOG
QUICKSIGHT
AMAZON
REDSHIFT
SPECTRUM

1 Crawlers scan your data sets and populate the Glue Data Catalog

2 The Glue Data Catalog serves as a central metadata repository

3 Once catalogued in Glue, your data is immediately available for analytics

© 2019, Amazon Web Services, Inc. or its Affiliates. https://github.com/aws-samples/serverless-data-analytics


Serverless analytics
Deliver cost-effective analytic solutions faster
AWS Glue

$ Data
Catalog

Serverless: no Never pay for


Amazon Amazon Amazon
infrastructure, no idle resources S3 AI/ML
Athena
administration
Data Lake

AWS IoT Amazon


QuickSight
Automatically Availability
scales and fault
resources with tolerance
usage built in
Devices Web Sensors Social
© 2019, Amazon Web Services, Inc. or its Affiliates.
* This is a hypothetical example. Costs will vary based on actual workload.

Serverless analytics
Pr o o f - o f - c o n c e p t e s t i m a t i o n
AWS Glue
30 partitions/month Data
$0/month Catalog
5,000 queries/month
$0.005/query ~$5/user/month

100 GB/month Amazon Amazon Amazon


~ $23/month S3 Athena QuickSight

Data Lake
Ingest = ~ $35
Storage = ~ $23
~$35/month AWS IoT Query = ~ $25
5 BI users = ~ $25

10,000 devices
• Total POC cost = ~ $108/month
8KB/device/hr
© 2019, Amazon Web Services, Inc. or its Affiliates. • That’s $3.60/day
* This is a hypothetical example. Costs will vary based on actual workload.

Serverless Analytics
Pr o d u c t i o n w o r k l o a d e s t i m a t i o n
10 hr/day/month AWS Glue
20 x M4.Xlarge Data
~ $1,900 month Catalog 100,000 queries/month
$0.005 (1 GB)/query ~ $5/user/month

Amazon Amazon Amazon


100 TB
S3 Athena QuickSight
~ $2,200/month

Data lake
ETL on-demand = ~ $1,000 (Spot discount)
>= 90% discount with Spot Instances

Amazon Storage = ~ $2,200


SageMaker
Notebooks
Query = ~ $500
5 notebooks (m4.xl)/8 hr
100 BI Users = ~ $500
~ $360/month Notebooks = ~ $360

* Total cost = ~ $4,560/month


© 2019, Amazon Web Services, Inc. or its Affiliates.
Common Streaming Analytics Architecture

Potential problem:
1. Too many small files
2. Not necessarily optimized for
Machine
Analytics Amazon Learning
Athena

Amazon
Kinesis Data science
Applications Firehose Presto/Spark
S3 Data Lake on EMR

Amazon Redshift
Reporting
Data Warehouse

© 2019, Amazon Web Services, Inc. or its Affiliates.


Log analytics, ClickStream analytics, IoT sensor data
Glue ETL

Hourly Compactions
to Parquet/ORC
Machine
Athena Learning

Kinesis Data science


Applications Firehose Tier 1 S3 Datalake: Tier 2 S3 Datalake: Presto/Spark
Raw Data Analytics on EMR

Amazon Redshift
Reporting
Data Warehouse

© 2019, Amazon Web Services, Inc. or its Affiliates.


Replacing a database replica
Potential problem:
Updates and deletes creates
new versions of records
Machine
Athena Learning

Data science
AWS DMS Presto/Spark
Databases S3 Data Lake on EMR

Amazon Redshift
Reporting
Data Warehouse

© 2019, Amazon Web Services, Inc. or its Affiliates.


Replacing a database replica
Create Views to preserve the
database view of records
Glue ETL
Potential problem:
Grouping records in Views
can be expensive over time

Athena

DMS Tier 1 S3 Datalake: Tier 2 S3 Datalake: Presto/Spark


Databases
Raw Data Analytics on EMR

Amazon Redshift
Data Warehouse

© 2019, Amazon Web Services, Inc. or its Affiliates.


Replacing a database replica Creates daily snapshots to
preserve the database view
of records
Glue ETL

Athena

DMS Tier 1 S3 Datalake: Tier 2 S3 Datalake: Snapshot Presto/Spark


Databases
Raw Data Analytics Analytics on EMR

Amazon Redshift
Data Warehouse

© 2019, Amazon Web Services, Inc. or its Affiliates.


Machine Learning—Predictions on streaming
data Amazon SageMaker
Lambda Endpoints Glue ETL

Athena

Kinesis
Databases Firehose Tier 1 S3 Datalake: Tier 2 S3 Datalake: Presto/Spark
Raw Data Analytics on EMR

Potential problem:
Tier 1 raw data should have Amazon Redshift
the least transformations Data Warehouse

© 2019, Amazon Web Services, Inc. or its Affiliates.


Machine Learning—Predictions on streaming
data Amazon SageMaker
Glue ETL Endpoints

Athena

Kinesis
Databases Firehose Tier 1 S3 Datalake: Tier 2 S3 Datalake: Presto/Spark
Raw Data Analytics on EMR

Amazon Redshift
Data Warehouse

© 2019, Amazon Web Services, Inc. or its Affiliates.


AWS AWS AWS Amazon
IAM KMS Cloud Trail CloudWatch

Modern Data Architecture


Insights to enhance business applications, new digital services
Data Ingest Scale (Batch) Serving Data Scientists
Sources ETL
Amazon Glue

Direct Query
Amazon Athena
Internet
Raw Data Hadoop Staged Data (Data Lake)
Interfaces
Amazon S3 Amazon EMR Amazon S3 Data Analysts
Network xDRs
Schemaless
Amazon ElasticSearch

AWS Direct
Connect Advanced Analytics
MLlib
Semi/Unstructured Business Users
IoT Amazon EMR

AWS Database
Migration Data Warehouse
Social Media Amazon Redshift Engagement Platforms
Stream Analysis
Amazon EMR Amazon Amazon Amazon
Amazon Kinesis Machine S3 Athena
Legacy Apps
Learning
Web / Logs Amazon RDS

Event Capture
Network Automation /
Amazon Kinesis Near-Zero Latency Events
Internet of Things Amazon DynamoDB
OSS/BSS
Speed (Real-Time) Machine Learning / Auditing
© 2019, Amazon Web Services, Inc. or its Affiliates.
Additional resources

Landing Page and Resource Hub


Data Lakes and Analytics on AWS – https://aws.amazon.com/big-data/datalakes-and-analytics/
Data, Analytics, and Machine Learning Resource Hub - https://resources.awscloud.com/aws-data-analytics-machinelearning
Modernize your Analytics and Data Architecture - https://aws.amazon.com/events/data-analytics-series/tech/

e-Books and Blogs


Data Lifecycle and Analytics in the AWS Cloud Reference Guide for Public Sector - https://pages.awscloud.com/data-lifecycle-reference-
guide.html
Creating a Modern Analytics Architecture - https://resources.awscloud.com/aws-data-analytics-machinelearning/creating-a-modern-analytics-
architecture-e-book
Harness the Power of Data - https://resources.awscloud.com/aws-data-analytics-machinelearning/harness-the-power-of-data-e-book
Importance of data in today’s digital transformation - https://resources.awscloud.com/aws-data-analytics-machinelearning/harness-the-power-
of-data-e-book
Becoming a Data-driven Organization – https://d1.awsstatic.com/executive-insights/en_US/ebook-becoming-data-driven.pdf
Building a Data Lake on AWS – https://s3.amazonaws.com/big-data-ipc/AWS_Data-Lake_eBook.pdf
How to Create a Data-Driven Culture – https://aws.amazon.com/blogs/enterprise-strategy/how-to-create-a-data-driven-culture/
How to Build Data Capabilities – https://aws.amazon.com/blogs/enterprise-strategy/how-to-build-data-capabilities/
Enter the Purpose-Built Database Era – https://pages.awscloud.com/rs/112-TZM-766/images/Enter_the_Purpose-Built-Database-Era.pdf

Whitepapers
Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility – https://d1.awsstatic.com/whitepapers/Storage/data-lake-on-aws.pdf
Big Data Analytics Options on AWS – http://d0.awsstatic.com/whitepapers/Big_Data_Analytics_Options_on_AWS.pdf
Thank You

© 2019, Amazon Web Services, Inc. or its Affiliates.

You might also like