0% found this document useful (0 votes)
98 views13 pages

AWS Data Lake

A data lake is a centralized, curated, and secured repository that stores all data in its original form and prepared for analysis. Data Catalog extracts metadata about the data in the data lake and exposes it to users so they know what data is available and its format and sensitivity. It extracts metadata when new data is added to the data lake and stores it in DynamoDB.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views13 pages

AWS Data Lake

A data lake is a centralized, curated, and secured repository that stores all data in its original form and prepared for analysis. Data Catalog extracts metadata about the data in the data lake and exposes it to users so they know what data is available and its format and sensitivity. It extracts metadata when new data is added to the data lake and stores it in DynamoDB.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

A data lake is a centralized, curated, and secured repository that stores all your data, both in

its original form and prepared for analysis.

:-
:-

data analysts and scientists :-


“Which of these tables actually contain the data that I need for my
analysis?” This question is addressed by Datacatalog

Data is collected and stored in S3 then you have extract the metadata via Data Catalogue
process to give information about what lives in the Datalake

Data Catalogue:- How do people know what is there in my data lake?


Extract metadata about your data and expose that to users.
For ex 1> what format I am keeping data ie Json,CSV
2> Data is sensitive or classified
3> Enable data discovery based on search tags i.e. Search based on tweets or department
so that people can extract data beased on search mechanism and process the data
4> Build an API layer on top of it so that different teams can consume it

DATA CATALOGUE:-
1> Data collected in S3. S3 will generate an event
2> When new object comes it can trigger an event ie Lambda Function which runs a
piece of code. You can have our own logic to extract metadata
3> Extract the metadata and store it in NOSQL database ie Amazon Dynamo Db
(persistent storage for your metadata)
4> Once the metadata comesin then it can trigger another AWS lambda function which
can pick up data and push to Amazon Elastic search service which can be used by
team to search what is there in my catalogue

AWS GLUE offers the Data Catalog service, run ETL service. AWS glue is built on Apache
Spark

Crawlers can crawl diferent data sources and extract metadata and populate your catalog
For ex:- Crawl through your S3 buckets and when there is new objects it can extract
metadata and populate your catalog.

Similarly crawl your databases, datawarehouses


Define your ETL pipeline manually as below

For ex where my data sources live and then do some transformation and push it to the
destination. You can define all this.

Glue will generate Python code for the same from your definition

Then run the ETL job on a particular Frequency, schedule etc on a spark cluster that Glue
manages on your behalf. You are charged only when your job runs
:-

Persons with different tools who want to access data:-


Dev write python/Scala code
Business user will goto a dashboard
Data Scientist/ Data analyst will write a SQL

Amazon EMR-EMRFS exposes ur S3 as an HDFS filesystem for your hadoop cluster. Your
application still talks to S3 but EMRFS makes it think as if it is HDFS
EMR is the processing engine which is used to aggregate the data and dumpt it for users to
do analysis

Amazon Redshift is used for all BI dashboard usecases

When you go to Redshift, you spin up a cluster The cluster has a leader node which exposes
a JDBC/ODBC endpoint. Any of the SQL clients can connect to Leader node using JDBC/ODBC
end point. Behind the Leader node is the compute node which stores data on Them. When
you submit a query it goes to the leader node which computes the query execution plan
generates a c++ code which is sent to all the Compute nodes and all the compute nodes
work in parallel and execute the query and send back the result to the leader node which
does the final aggregation and send data back to the SQL client

A compute node is partitioned into slices. Each slice is allocated a portion of the node's
memory and disk space, where it processes a portion of the workload assigned to the
node. The slices then work in parallel to complete the operation

All the data is distributed parallely across the cluster

As data grows you can add more compute nodes in the cluster to store data
Start with SN cluster and then move to Multiple Node cluster or viceversa

Increase CN to scale
Decrease CN to scale down

Benefitis of columnar storage:-

1> Better data compression


2> More dta can be stored bec of which I/O decreases
3> Better analyis on similar kind of data as they are stored in one block
4> Zone Maps- This stores min and mzx value of all the blocks in memeory. This helps to
skip unwanted blocks
5> Large block size
6> Direct attached storage so no network issues

Amazon Redshift Spectrum can directly query from data lake without having to
move data to Redshift

Data sitting on an S3 and you can write SQL queries to analyze the data. Amazon
Athena runs SQL query against your S3 data
Amazon Quicksight:-

Spice is in-memory engine. You can load data in this in-memory engine which can
speed up queries for all the dashboards
VPC
IAM
KMS
AWS cloudtrail

Building an Amazon S3 Data Lake Using AWS Database


Migration Service 
https://www.youtube.com/watch?v=VRhKO71mEvs

You might also like