AWS Data Lake
AWS Data Lake
:-
:-
Data is collected and stored in S3 then you have extract the metadata via Data Catalogue
process to give information about what lives in the Datalake
DATA CATALOGUE:-
1> Data collected in S3. S3 will generate an event
2> When new object comes it can trigger an event ie Lambda Function which runs a
piece of code. You can have our own logic to extract metadata
3> Extract the metadata and store it in NOSQL database ie Amazon Dynamo Db
(persistent storage for your metadata)
4> Once the metadata comesin then it can trigger another AWS lambda function which
can pick up data and push to Amazon Elastic search service which can be used by
team to search what is there in my catalogue
AWS GLUE offers the Data Catalog service, run ETL service. AWS glue is built on Apache
Spark
Crawlers can crawl diferent data sources and extract metadata and populate your catalog
For ex:- Crawl through your S3 buckets and when there is new objects it can extract
metadata and populate your catalog.
For ex where my data sources live and then do some transformation and push it to the
destination. You can define all this.
Glue will generate Python code for the same from your definition
Then run the ETL job on a particular Frequency, schedule etc on a spark cluster that Glue
manages on your behalf. You are charged only when your job runs
:-
Amazon EMR-EMRFS exposes ur S3 as an HDFS filesystem for your hadoop cluster. Your
application still talks to S3 but EMRFS makes it think as if it is HDFS
EMR is the processing engine which is used to aggregate the data and dumpt it for users to
do analysis
When you go to Redshift, you spin up a cluster The cluster has a leader node which exposes
a JDBC/ODBC endpoint. Any of the SQL clients can connect to Leader node using JDBC/ODBC
end point. Behind the Leader node is the compute node which stores data on Them. When
you submit a query it goes to the leader node which computes the query execution plan
generates a c++ code which is sent to all the Compute nodes and all the compute nodes
work in parallel and execute the query and send back the result to the leader node which
does the final aggregation and send data back to the SQL client
A compute node is partitioned into slices. Each slice is allocated a portion of the node's
memory and disk space, where it processes a portion of the workload assigned to the
node. The slices then work in parallel to complete the operation
As data grows you can add more compute nodes in the cluster to store data
Start with SN cluster and then move to Multiple Node cluster or viceversa
Increase CN to scale
Decrease CN to scale down
Amazon Redshift Spectrum can directly query from data lake without having to
move data to Redshift
Data sitting on an S3 and you can write SQL queries to analyze the data. Amazon
Athena runs SQL query against your S3 data
Amazon Quicksight:-
Spice is in-memory engine. You can load data in this in-memory engine which can
speed up queries for all the dashboards
VPC
IAM
KMS
AWS cloudtrail