SlideShare a Scribd company logo
Data Lake Demonstration
Building a Simple Data Lake on AWS
Gary A. Stafford
Twitter/LinkedIn
GaryStafford
Blog
garystafford.medium.com
What is a Data Lake?
Agenda
What is a Data Lake?
Dataset
Source Code
Architecture
Demonstration
What is a Data Lake?
“A data lake is a central location that holds a large amount of data in its native, raw
format. Compared to a hierarchical data warehouse, which stores data in files or
folders, a data lake uses a flat architecture and object storage to store the data.” -
Databricks
“A centralized repository that allows you to store all your structured and
unstructured data at any scale. You can store your data as-is, without having to
first structure the data, and run different types of analytics—from dashboards and
visualizations to big data processing, real-time analytics, and machine learning to
guide better decisions.” - AWS
What is a Data Lake?
Dataset
Dataset
TICKIT database
E-commerce platform
Bringing together buyers and sellers of tickets to entertainments events
Designed for demonstrating Amazon Redshift
Small database consists of seven tables: two fact tables and five dimensions
Tables: Categories, Events, Venues, Users, Listings, Sales, Dates
https://docs.aws.amazon.com/redshift/latest/dg/c_sampledb.html
Dataset
Dataset
Table Simulated Data Source Demo Data Source
Category Software as a Service (SaaS) Amazon RDS for PostgreSQL
Event Software as a Service (SaaS) Amazon RDS for PostgreSQL
Venue Software as a Service (SaaS) Amazon RDS for PostgreSQL
Listing Ecommerce Platform Amazon RDS for MySQL
Sales Ecommerce Platform Amazon RDS for MySQL
Date Ecommerce Platform Amazon RDS for MySQL
Users Customer Relationship Management (CRM) Microsoft SQL Server
Dataset
Source Code
github.com/garystafford/tickit-data-lake-demo
Architecture
Architecture: AWS Services Used
AWS Glue Studio (alt. AWS Glue DataBrew)
AWS Glue Data Catalog (alt. Apache Hive on EMR)
AWS Glue Crawlers (alt. CDC with Kafka Connect or DMS)
AWS Glue Jobs (alt. AWS Glue DataBrew, or Spark or Presto on EMR)
Amazon Athena (alt. Presto on EMR)
Amazon S3
Building a Data Lake on AWS
Architecture: Out of Scope (but critically important)
Change Data Capture (CDC): handling changes to systems of record
Transactional Storage Layer: Apache Hudi, Apache Iceberg, Delta Lake
Streaming Data: Spark Structured Streaming, Kinesis, Flink
Fine-grained Authorization: database-, table-, column-, and row-level access
Data Lineage: Tracking data as it flows from sources to consumption
Architecture: Out of Scope (but critically important)
Data Inspection: Scanning incoming data for sensitive info such as PII
DevOps/DataOps: Automating testing, deployment, job execution
Data Warehouse / Lake House Architecture
Data Lake Storage Tiering
Demonstration
Ad

More Related Content

What's hot (20)

Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural search
Dmitry Kan
 
Best Practices in the Use of Columnar Databases
Best Practices in the Use of Columnar DatabasesBest Practices in the Use of Columnar Databases
Best Practices in the Use of Columnar Databases
DATAVERSITY
 
Elastic Stack Introduction
Elastic Stack IntroductionElastic Stack Introduction
Elastic Stack Introduction
Vikram Shinde
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
Divij Sehgal
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data Warehouse
Snowflake Computing
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
Cobus Bernard
 
Deploy and Serve Model from Azure Databricks onto Azure Machine Learning
Deploy and Serve Model from Azure Databricks onto Azure Machine LearningDeploy and Serve Model from Azure Databricks onto Azure Machine Learning
Deploy and Serve Model from Azure Databricks onto Azure Machine Learning
Databricks
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
James Serra
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
confluent
 
Modern Data Architecture
Modern Data ArchitectureModern Data Architecture
Modern Data Architecture
Alexey Grishchenko
 
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
Amazon Web Services Korea
 
Azure Cosmos DB
Azure Cosmos DBAzure Cosmos DB
Azure Cosmos DB
Mohamed Tawfik
 
Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)
Adrien Blind
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
Jurriaan Persyn
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptx
CalvinSim10
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher   Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher
Tamir Dresher
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural search
Dmitry Kan
 
Best Practices in the Use of Columnar Databases
Best Practices in the Use of Columnar DatabasesBest Practices in the Use of Columnar Databases
Best Practices in the Use of Columnar Databases
DATAVERSITY
 
Elastic Stack Introduction
Elastic Stack IntroductionElastic Stack Introduction
Elastic Stack Introduction
Vikram Shinde
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data Warehouse
Snowflake Computing
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
Cobus Bernard
 
Deploy and Serve Model from Azure Databricks onto Azure Machine Learning
Deploy and Serve Model from Azure Databricks onto Azure Machine LearningDeploy and Serve Model from Azure Databricks onto Azure Machine Learning
Deploy and Serve Model from Azure Databricks onto Azure Machine Learning
Databricks
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
James Serra
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
confluent
 
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
Amazon Web Services Korea
 
Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)
Adrien Blind
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
Jurriaan Persyn
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptx
CalvinSim10
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher   Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher
Tamir Dresher
 

Similar to Building a Data Lake on AWS (9)

Building Data Lakes with Apache Airflow
Building Data Lakes with Apache AirflowBuilding Data Lakes with Apache Airflow
Building Data Lakes with Apache Airflow
Gary Stafford
 
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
Sungmin Kim
 
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon MeichtryAWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
Amazon Web Services Korea
 
AWS Big Data Landscape
AWS Big Data LandscapeAWS Big Data Landscape
AWS Big Data Landscape
Crishantha Nanayakkara
 
Your First Data Lake on AWS_Simon Elisha
Your First Data Lake on AWS_Simon ElishaYour First Data Lake on AWS_Simon Elisha
Your First Data Lake on AWS_Simon Elisha
Helen Rogers
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Amazon Web Services LATAM
 
Building Serverless Data Infrastructure in the AWS Cloud
Building Serverless Data Infrastructure in the AWS CloudBuilding Serverless Data Infrastructure in the AWS Cloud
Building Serverless Data Infrastructure in the AWS Cloud
Ryan Plant
 
DBP-010_Using Azure Data Services for Modern Data Applications
DBP-010_Using Azure Data Services for Modern Data ApplicationsDBP-010_Using Azure Data Services for Modern Data Applications
DBP-010_Using Azure Data Services for Modern Data Applications
decode2016
 
AWS Tech Talks - Data Lake Analytics
AWS Tech Talks - Data Lake AnalyticsAWS Tech Talks - Data Lake Analytics
AWS Tech Talks - Data Lake Analytics
Amazon Web Services LATAM
 
Building Data Lakes with Apache Airflow
Building Data Lakes with Apache AirflowBuilding Data Lakes with Apache Airflow
Building Data Lakes with Apache Airflow
Gary Stafford
 
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
Sungmin Kim
 
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon MeichtryAWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
Amazon Web Services Korea
 
Your First Data Lake on AWS_Simon Elisha
Your First Data Lake on AWS_Simon ElishaYour First Data Lake on AWS_Simon Elisha
Your First Data Lake on AWS_Simon Elisha
Helen Rogers
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Amazon Web Services LATAM
 
Building Serverless Data Infrastructure in the AWS Cloud
Building Serverless Data Infrastructure in the AWS CloudBuilding Serverless Data Infrastructure in the AWS Cloud
Building Serverless Data Infrastructure in the AWS Cloud
Ryan Plant
 
DBP-010_Using Azure Data Services for Modern Data Applications
DBP-010_Using Azure Data Services for Modern Data ApplicationsDBP-010_Using Azure Data Services for Modern Data Applications
DBP-010_Using Azure Data Services for Modern Data Applications
decode2016
 
Ad

More from Gary Stafford (6)

Stream Processing with Apache Spark, Kafka, Avro, and Apicurio Registry on AW...
Stream Processing with Apache Spark, Kafka, Avro, and Apicurio Registry on AW...Stream Processing with Apache Spark, Kafka, Avro, and Apicurio Registry on AW...
Stream Processing with Apache Spark, Kafka, Avro, and Apicurio Registry on AW...
Gary Stafford
 
Building Open Data Lakes on AWS with Debezium and Apache Hudi
Building Open Data Lakes on AWS with Debezium and Apache HudiBuilding Open Data Lakes on AWS with Debezium and Apache Hudi
Building Open Data Lakes on AWS with Debezium and Apache Hudi
Gary Stafford
 
How Mature is Your Infrastructure?
How Mature is Your Infrastructure?How Mature is Your Infrastructure?
How Mature is Your Infrastructure?
Gary Stafford
 
Infrastructure as Code Maturity Model v1
Infrastructure as Code Maturity Model v1Infrastructure as Code Maturity Model v1
Infrastructure as Code Maturity Model v1
Gary Stafford
 
Enterprise DevOps Adoption LinkedIn
Enterprise DevOps Adoption LinkedInEnterprise DevOps Adoption LinkedIn
Enterprise DevOps Adoption LinkedIn
Gary Stafford
 
From Zurich to the Cosmos, by Artist Steve Carpenter
From Zurich to the Cosmos, by Artist Steve CarpenterFrom Zurich to the Cosmos, by Artist Steve Carpenter
From Zurich to the Cosmos, by Artist Steve Carpenter
Gary Stafford
 
Stream Processing with Apache Spark, Kafka, Avro, and Apicurio Registry on AW...
Stream Processing with Apache Spark, Kafka, Avro, and Apicurio Registry on AW...Stream Processing with Apache Spark, Kafka, Avro, and Apicurio Registry on AW...
Stream Processing with Apache Spark, Kafka, Avro, and Apicurio Registry on AW...
Gary Stafford
 
Building Open Data Lakes on AWS with Debezium and Apache Hudi
Building Open Data Lakes on AWS with Debezium and Apache HudiBuilding Open Data Lakes on AWS with Debezium and Apache Hudi
Building Open Data Lakes on AWS with Debezium and Apache Hudi
Gary Stafford
 
How Mature is Your Infrastructure?
How Mature is Your Infrastructure?How Mature is Your Infrastructure?
How Mature is Your Infrastructure?
Gary Stafford
 
Infrastructure as Code Maturity Model v1
Infrastructure as Code Maturity Model v1Infrastructure as Code Maturity Model v1
Infrastructure as Code Maturity Model v1
Gary Stafford
 
Enterprise DevOps Adoption LinkedIn
Enterprise DevOps Adoption LinkedInEnterprise DevOps Adoption LinkedIn
Enterprise DevOps Adoption LinkedIn
Gary Stafford
 
From Zurich to the Cosmos, by Artist Steve Carpenter
From Zurich to the Cosmos, by Artist Steve CarpenterFrom Zurich to the Cosmos, by Artist Steve Carpenter
From Zurich to the Cosmos, by Artist Steve Carpenter
Gary Stafford
 
Ad

Recently uploaded (20)

md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 

Building a Data Lake on AWS

  • 1. Data Lake Demonstration Building a Simple Data Lake on AWS Gary A. Stafford
  • 3. What is a Data Lake?
  • 4. Agenda What is a Data Lake? Dataset Source Code Architecture Demonstration
  • 5. What is a Data Lake? “A data lake is a central location that holds a large amount of data in its native, raw format. Compared to a hierarchical data warehouse, which stores data in files or folders, a data lake uses a flat architecture and object storage to store the data.” - Databricks “A centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.” - AWS
  • 6. What is a Data Lake?
  • 8. Dataset TICKIT database E-commerce platform Bringing together buyers and sellers of tickets to entertainments events Designed for demonstrating Amazon Redshift Small database consists of seven tables: two fact tables and five dimensions Tables: Categories, Events, Venues, Users, Listings, Sales, Dates https://docs.aws.amazon.com/redshift/latest/dg/c_sampledb.html
  • 10. Dataset Table Simulated Data Source Demo Data Source Category Software as a Service (SaaS) Amazon RDS for PostgreSQL Event Software as a Service (SaaS) Amazon RDS for PostgreSQL Venue Software as a Service (SaaS) Amazon RDS for PostgreSQL Listing Ecommerce Platform Amazon RDS for MySQL Sales Ecommerce Platform Amazon RDS for MySQL Date Ecommerce Platform Amazon RDS for MySQL Users Customer Relationship Management (CRM) Microsoft SQL Server
  • 15. Architecture: AWS Services Used AWS Glue Studio (alt. AWS Glue DataBrew) AWS Glue Data Catalog (alt. Apache Hive on EMR) AWS Glue Crawlers (alt. CDC with Kafka Connect or DMS) AWS Glue Jobs (alt. AWS Glue DataBrew, or Spark or Presto on EMR) Amazon Athena (alt. Presto on EMR) Amazon S3
  • 17. Architecture: Out of Scope (but critically important) Change Data Capture (CDC): handling changes to systems of record Transactional Storage Layer: Apache Hudi, Apache Iceberg, Delta Lake Streaming Data: Spark Structured Streaming, Kinesis, Flink Fine-grained Authorization: database-, table-, column-, and row-level access Data Lineage: Tracking data as it flows from sources to consumption
  • 18. Architecture: Out of Scope (but critically important) Data Inspection: Scanning incoming data for sensitive info such as PII DevOps/DataOps: Automating testing, deployment, job execution Data Warehouse / Lake House Architecture Data Lake Storage Tiering