0% found this document useful (0 votes)

98 views13 pages

AWS Data Lake

A data lake is a centralized, curated, and secured repository that stores all data in its original form and prepared for analysis. Data Catalog extracts metadata about the data in the data lake and exposes it to users so they know what data is available and its format and sensitivity. It extracts metadata when new data is added to the data lake and stores it in DynamoDB.

Uploaded by

Suvankar Chakraborty

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

98 views13 pages

AWS Data Lake

Uploaded by

Suvankar Chakraborty

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 13

A data lake is a centralized, curated, and secured repository that stores all your data, both in

its original form and prepared for analysis.

:-
:-

data analysts and scientists :-

“Which of these tables actually contain the data that I need for my
analysis?” This question is addressed by Datacatalog

Data is collected and stored in S3 then you have extract the metadata via Data Catalogue
process to give information about what lives in the Datalake

Data Catalogue:- How do people know what is there in my data lake?

Extract metadata about your data and expose that to users.
For ex 1> what format I am keeping data ie Json,CSV
2> Data is sensitive or classified
3> Enable data discovery based on search tags i.e. Search based on tweets or department
so that people can extract data beased on search mechanism and process the data
4> Build an API layer on top of it so that different teams can consume it

DATA CATALOGUE:-
1> Data collected in S3. S3 will generate an event
2> When new object comes it can trigger an event ie Lambda Function which runs a
piece of code. You can have our own logic to extract metadata
3> Extract the metadata and store it in NOSQL database ie Amazon Dynamo Db
(persistent storage for your metadata)
4> Once the metadata comesin then it can trigger another AWS lambda function which
can pick up data and push to Amazon Elastic search service which can be used by
team to search what is there in my catalogue

AWS GLUE offers the Data Catalog service, run ETL service. AWS glue is built on Apache
Spark

Crawlers can crawl diferent data sources and extract metadata and populate your catalog
For ex:- Crawl through your S3 buckets and when there is new objects it can extract
metadata and populate your catalog.

Similarly crawl your databases, datawarehouses

Define your ETL pipeline manually as below

For ex where my data sources live and then do some transformation and push it to the
destination. You can define all this.

Glue will generate Python code for the same from your definition

Then run the ETL job on a particular Frequency, schedule etc on a spark cluster that Glue
manages on your behalf. You are charged only when your job runs
:-

Persons with different tools who want to access data:-

Dev write python/Scala code
Business user will goto a dashboard
Data Scientist/ Data analyst will write a SQL

Amazon EMR-EMRFS exposes ur S3 as an HDFS filesystem for your hadoop cluster. Your
application still talks to S3 but EMRFS makes it think as if it is HDFS
EMR is the processing engine which is used to aggregate the data and dumpt it for users to
do analysis

Amazon Redshift is used for all BI dashboard usecases

When you go to Redshift, you spin up a cluster The cluster has a leader node which exposes
a JDBC/ODBC endpoint. Any of the SQL clients can connect to Leader node using JDBC/ODBC
end point. Behind the Leader node is the compute node which stores data on Them. When
you submit a query it goes to the leader node which computes the query execution plan
generates a c++ code which is sent to all the Compute nodes and all the compute nodes
work in parallel and execute the query and send back the result to the leader node which
does the final aggregation and send data back to the SQL client

A compute node is partitioned into slices. Each slice is allocated a portion of the node's
memory and disk space, where it processes a portion of the workload assigned to the
node. The slices then work in parallel to complete the operation

All the data is distributed parallely across the cluster

As data grows you can add more compute nodes in the cluster to store data
Start with SN cluster and then move to Multiple Node cluster or viceversa

Increase CN to scale
Decrease CN to scale down

Benefitis of columnar storage:-

1> Better data compression

2> More dta can be stored bec of which I/O decreases
3> Better analyis on similar kind of data as they are stored in one block
4> Zone Maps- This stores min and mzx value of all the blocks in memeory. This helps to
skip unwanted blocks
5> Large block size
6> Direct attached storage so no network issues

Amazon Redshift Spectrum can directly query from data lake without having to
move data to Redshift

Data sitting on an S3 and you can write SQL queries to analyze the data. Amazon
Athena runs SQL query against your S3 data
Amazon Quicksight:-

Spice is in-memory engine. You can load data in this in-memory engine which can
speed up queries for all the dashboards
VPC
IAM
KMS
AWS cloudtrail

Building an Amazon S3 Data Lake Using AWS Database

Migration Service
https://www.youtube.com/watch?v=VRhKO71mEvs

azure comapny wise question
No ratings yet
azure comapny wise question
68 pages
Cheat Sheet AWS Data Engineer Associate
No ratings yet
Cheat Sheet AWS Data Engineer Associate
117 pages
Iti Pdfs
No ratings yet
Iti Pdfs
10 pages
Big Data Practice
No ratings yet
Big Data Practice
93 pages
DW Olap
No ratings yet
DW Olap
57 pages
Data Warehouse - What Is It
No ratings yet
Data Warehouse - What Is It
5 pages
DWH Fundamentals
No ratings yet
DWH Fundamentals
63 pages
Teradata Scripts
No ratings yet
Teradata Scripts
998 pages
Pyspark
No ratings yet
Pyspark
31 pages
SCD in Databricks
No ratings yet
SCD in Databricks
16 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Azure Data Engineer Mock Interview - Project Special
No ratings yet
Azure Data Engineer Mock Interview - Project Special
11 pages
ADE Azure Data Engineer Interview
No ratings yet
ADE Azure Data Engineer Interview
12 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
DW Concepts
No ratings yet
DW Concepts
40 pages
What Are DBT Sources
No ratings yet
What Are DBT Sources
109 pages
Unit-7 Transaction Processing
No ratings yet
Unit-7 Transaction Processing
107 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
DWDM Single PPT Notes
No ratings yet
DWDM Single PPT Notes
169 pages
Top 50 Data Warehousing Interview Questions & Answers
No ratings yet
Top 50 Data Warehousing Interview Questions & Answers
8 pages
Data Warehouse Ques
No ratings yet
Data Warehouse Ques
10 pages
Pyspark Study Material
No ratings yet
Pyspark Study Material
5 pages
MSSQL Server 2008 Developer
No ratings yet
MSSQL Server 2008 Developer
240 pages
Learneverythingai 1661068200
No ratings yet
Learneverythingai 1661068200
66 pages
Sravani Soma-ETL Resume
No ratings yet
Sravani Soma-ETL Resume
3 pages
Interview Series ADF Part-1
No ratings yet
Interview Series ADF Part-1
17 pages
Srikanth
No ratings yet
Srikanth
7 pages
Extract Transform Load
No ratings yet
Extract Transform Load
80 pages
DW
No ratings yet
DW
29 pages
Aws (S3, Iam, Ec2, Emr and Redshift)
100% (1)
Aws (S3, Iam, Ec2, Emr and Redshift)
16 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
4 Data-Testing PDF
No ratings yet
4 Data-Testing PDF
79 pages
T09 Data Streaming
No ratings yet
T09 Data Streaming
52 pages
ADF Notes
No ratings yet
ADF Notes
1 page
Data Modeler Release Notes
No ratings yet
Data Modeler Release Notes
81 pages
Azure Data Factory
No ratings yet
Azure Data Factory
6 pages
Lesson 3 - Data - Ingestion - Into - Big - Data - Systems - and - ETL
No ratings yet
Lesson 3 - Data - Ingestion - Into - Big - Data - Systems - and - ETL
104 pages
Tungban Machine Learning Math Course
No ratings yet
Tungban Machine Learning Math Course
124 pages
Dec 01 2020
No ratings yet
Dec 01 2020
298 pages
Azure DataBricks Interview Questions
No ratings yet
Azure DataBricks Interview Questions
17 pages
5.data Warehouse
No ratings yet
5.data Warehouse
19 pages
Datatypes in Hive
No ratings yet
Datatypes in Hive
31 pages
OLTP
No ratings yet
OLTP
12 pages
Sqoop Cammand
No ratings yet
Sqoop Cammand
8 pages
Data Quality Analyst: Professiona L Profile
No ratings yet
Data Quality Analyst: Professiona L Profile
2 pages
Senior Data Engineer - Danial Syafiq J
No ratings yet
Senior Data Engineer - Danial Syafiq J
4 pages
Etl
No ratings yet
Etl
13 pages
Tamil Rosary
No ratings yet
Tamil Rosary
12 pages
Data Model - Important - Concepts
No ratings yet
Data Model - Important - Concepts
24 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Accenture Project Presentation
No ratings yet
Accenture Project Presentation
11 pages
Python Program
No ratings yet
Python Program
7 pages
Airflow
No ratings yet
Airflow
37 pages
Dataeng-Zoomcamp - 5 - Batch - Processing - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
No ratings yet
Dataeng-Zoomcamp - 5 - Batch - Processing - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
41 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
Dimensional Modeling PDF
No ratings yet
Dimensional Modeling PDF
14 pages
Modernserverlessdatalak
No ratings yet
Modernserverlessdatalak
45 pages
Data Lake On Aws
No ratings yet
Data Lake On Aws
29 pages
Big Data PDF
No ratings yet
Big Data PDF
18 pages
Data Lakes For Maximum Flexibility
No ratings yet
Data Lakes For Maximum Flexibility
29 pages
Orange3 Text Mining Documentation: Biolab
No ratings yet
Orange3 Text Mining Documentation: Biolab
53 pages
Mongodb Indexes
No ratings yet
Mongodb Indexes
31 pages
Hadoop Ecosystem: Hdfs Mapreduce Yarn Hadoop Common
No ratings yet
Hadoop Ecosystem: Hdfs Mapreduce Yarn Hadoop Common
5 pages
Financial Management Application Copy Utility v1 4
No ratings yet
Financial Management Application Copy Utility v1 4
20 pages
OceanofPDF - Com Essential SQLAlchemy - Rick Copeland
No ratings yet
OceanofPDF - Com Essential SQLAlchemy - Rick Copeland
301 pages
DBMS_UNIT 5
No ratings yet
DBMS_UNIT 5
54 pages
Trees Datastructure
No ratings yet
Trees Datastructure
26 pages
Murachs SQL Server 2016 For Developers TOC
No ratings yet
Murachs SQL Server 2016 For Developers TOC
10 pages
Operation Manual Options FDR-1000AWS
No ratings yet
Operation Manual Options FDR-1000AWS
130 pages
Module 1
No ratings yet
Module 1
90 pages
Harshita Srivastava
No ratings yet
Harshita Srivastava
2 pages
(Ebook) Discovering SQL : a hands-on guide for beginners by Kriegel, Alex ISBN 9781118002674, 9781118092774, 9781118092781, 9781118092798, 1118002679, 1118092775, 1118092783, 1118092791 instant download
100% (1)
(Ebook) Discovering SQL : a hands-on guide for beginners by Kriegel, Alex ISBN 9781118002674, 9781118092774, 9781118092781, 9781118092798, 1118002679, 1118092775, 1118092783, 1118092791 instant download
60 pages
Auto-Scaling with Microsoft Fabric Capacity for Power BI
No ratings yet
Auto-Scaling with Microsoft Fabric Capacity for Power BI
6 pages
Resume Hritik Arora
No ratings yet
Resume Hritik Arora
2 pages
Itm Lab Questions
No ratings yet
Itm Lab Questions
12 pages
Jadwal Jaga Asisten Liburan Ata1819 (Revisi 1)
No ratings yet
Jadwal Jaga Asisten Liburan Ata1819 (Revisi 1)
27 pages
Lab 10 Report
No ratings yet
Lab 10 Report
10 pages
Assignment No1 - Modified
No ratings yet
Assignment No1 - Modified
22 pages
DW 2
No ratings yet
DW 2
40 pages
SQL Assignment 5
No ratings yet
SQL Assignment 5
4 pages
Import Data From Excel To Azure SQL Database Using Azure Data Factory
No ratings yet
Import Data From Excel To Azure SQL Database Using Azure Data Factory
24 pages
C++ Dat Structures
No ratings yet
C++ Dat Structures
44 pages
GSM Actix
100% (1)
GSM Actix
64 pages
Tracking Important System Limits
No ratings yet
Tracking Important System Limits
11 pages
Hyperion Techno Functional Consultant - Chiranjeevi - Sourceonapps
No ratings yet
Hyperion Techno Functional Consultant - Chiranjeevi - Sourceonapps
5 pages
Data Engineering For Everyone 1
No ratings yet
Data Engineering For Everyone 1
79 pages
Careers in Data Analytics - Abhinav Career Scope
No ratings yet
Careers in Data Analytics - Abhinav Career Scope
28 pages
Bisample RPD
No ratings yet
Bisample RPD
360 pages
Talent - Rise - SQL Assessment
No ratings yet
Talent - Rise - SQL Assessment
2 pages
HTTP KB - Blackboard
No ratings yet
HTTP KB - Blackboard
16 pages

AWS Data Lake

Uploaded by

AWS Data Lake

Uploaded by

A data lake is a centralized, curated, and secured repository that stores all your data, both in

its original form and prepared for analysis.

data analysts and scientists :-

Data Catalogue:- How do people know what is there in my data lake?

Similarly crawl your databases, datawarehouses

Persons with different tools who want to access data:-

Amazon Redshift is used for all BI dashboard usecases

All the data is distributed parallely across the cluster

Benefitis of columnar storage:-

1> Better data compression

Building an Amazon S3 Data Lake Using AWS Database

You might also like