SlideShare a Scribd company logo
DATA VAULT
2.0:
Big Data Meets Data Warehousing
DEAN HALLMAN
WIRESOFT, LLC
DATA WAREHOUSING VS BIG DATA
• Does Big Data replace Data Warehousing? Or do I need both?
• What’s the difference:
• Between the data flowing into a data warehouse vs big data tools?
• Between the ingestion processes and infrastructure?
• Data Lakes arrived with Big Data, so are they useful in Data
Warehousing?
• How should I model my data in EDW?
• 3NF, Star Schema, same as my operational data stores?
• Data Vault 2.0
• Graph Databases
• What is an architecture that allows both to co-exists effectively?
Impressions
(Big Data)
Core
Business
Services
Core
Business
Services
Core
Business
Services
Operational
Data Stores
D
A
T
A
L
A
K
E
Enterprise Data Warehouse
C DC ,
snapshot
Internet
External
Data
Sources
Big Data Toolchain
Batch
(SerDe)
Staging
Vault
Raw
Vault
Business
Vault
Information
Mart
Streaming
(Kafka)
Streaming
Analytics
Batch Analytics
(Hadoop)
Schema-on-Read
Schema-on-Write
Data Source
Landing
C lients
ETL
ELT
BI Tools
Monitoring
Discovery
Audit
clickstream
(SerDe)
ETL ETL
Impressions
(Big Data)
Core
Business
Services
Core
Business
Services
Core
Business
Services
Operational
Data Stores
D
A
T
A
L
A
K
E
Enterprise Data Warehouse
C DC ,
snapshot
Internet
External
Data
Sources
Big Data Toolchain
Batch
(SerDe)
Staging
Vault
Raw
Vault
Business
Vault
Information
Mart
Streaming
(Kafka)
Streaming
Analytics
Batch Analytics
(Hadoop)
Schema-on-Read
Schema-on-Write
Data Source
Landing
C lients
ETL
ELT
BI Tools
Monitoring
Discovery
Audit
clickstream
(SerDe)
ETL ETL
THE DATA MODEL
DATA VAULT 2.0
COMMON FOUNDATIONAL WAREHOUSE ARCHITECTURE
• “The Data Vault Model is a detail oriented, historical tracking and uniquely linked
set of normalized tables that support one or more functional areas of business. It is a
hybrid approach encompassing the best of breed between 3rd normal form (3NF)
and star schema. The design is flexible, scalable, consistent and adaptable to the
needs of the enterprise” -- Dan Linstedt, Creator of Data Vault
• Data loaded as-is from sources, no edits or cleanup
• Append-only to afford highest performance
• Agile & agnostic to changes in the operational store’s data model
• Essentially, a prescription for Layered Graph to Relational Mapping
DATA WAREHOUSING & DATA VAULT 2.0
• 60’s, 70’s, 80’s
• E.F. Codd => 3NF
• Bill Inmon invents Data Warehousing
concept
• Dr. Ralph Kimball popularizes Star
Schema design
• 90’s, 00’s:
• Dan Linstedt creates Data Vault Model @
DOD
• 2014:
• Dan Introduces Data Vault 2.0
datavault2.pptx
Source: “What are Graph Databases and Why should I care?“, by Dave Bechberger of Expero
SOLVE BY STAR SCHEMA ?
RELATIONAL VS GRAPH DATABASES
• Enterprise Grade
• Well-worn path
• SQL has been relatively stagnant vs programming languages
GRAPH DATA MODEL
Source: https://neo4j.com/developer/graph-database/
GRAPH DATABASE VS DATA VAULT
GRAPH DATABASE VS DATA VAULT
SERVICED_BY
Flight
Record Source Airport CAE
Load Date 2018-11-17
Source Id 20181117-32-983
Base Dest Forecast
Record
Source
LoadDate Depart Gate
LGA 2018-10-11 1:25P
M
B27
CAE 2018-10-24 3:30P
M
A14
SFO 2018-09-06 8:55P G19
M
RDU 2018-08-12 4:45P
M
C22
Aircraft
Record Source United Airlines
Load Date 2018-01-17
Source Id 2412c
Base Service FAA NTSB
Recor
d
Source
LoadDate Model Tailno
United 2017-02-11 767 1477
Delta 2015-11-04 A6 2381
Alaska 2013-08-28 747 8312
Frontie
r
2016-07-19 182 1438
r
SERVICED_BY
Record Source United Airlines
Load Date 2018-09-17
Base Dest Manifest
Recor
d
Source
LoadDate Begin End
United 2017-02-11 2017-04-23 2017-09-23
Delta 2015-11-04 2015-12-01 2017-04-22
Alaska 2013-08-28 2013-09-14 2016-05-04
Frontie 2016-07-19 2016-08-02 2018-04-11
Hubs
Links
Satellites
Tab
• Organizations which design systems ...
are constrained to produce designs
which are copies of the communication
structures of these organizations
- Mel Conway
FLIGHT
Base Dest Forecast
Record
Source
LoadDate Depart G ate
LG A 2018-10-
11
1:25P
M
B27
CAE 2018-10-
24
3:30P
M
A14
FLIGHT
Record Source Airport CAE
Load Date 2018-11-17
Source Id 20181117-32-983
Aircraft
Bas
e
Service FAA NTSB
Recor
d
Source
LoadDate Model Tailno
United 2017-02- 767 1477
11
Delta 2015-11- A6 2381
04
Alaska 2013-08- 747 8312
28
Frontie 2016-07- 182 1438
r 19
Record Source United Airlines
Load Date 2018-01-17
Source Id 2412c
Airport
Base Dest Manifest
Recor
d
Source
LoadDate Begin End
United 2017-02-11 2017-04-23 2017-09-
23
Delta 2015-11-04 2015-12-01 2017-04-
22
Alaska 2013-08-28 2013-09-14 2016-05-
04
Frontie 2016-07-19 2016-08-02 2018-04-
r 11
Record Source United Airlines
Load Date 2018-09-17
Airline
Base Service FAA
NTS
B
Record
Source
LoadDate Model Tailno
United 2017-02-11 767 1477
Delta 2015-11-04 A6 2381
Record Source United Airlines
Load Date 2018-01-17
Source Id 2412c
Hubs
Links
Satellites
Tab
Source: https://www.wherescape.com/solutions/project-types/data-vault-automation/
• Modeled after self-
organizing networks
• A Business Key identifies a
key concept in business.
• They have a business
meaning
• They are unique and
have very low propensity
to change
• Business keys change
only when the business
change
• Enables (forces) cross-
source modeling
Source: http://www.di.univr.it/documenti/OccorrenzaIns/matdid/matdid232240.pdf
datavault2.pptx
Source: http://www.di.univr.it/documenti/OccorrenzaIns/matdid/matdid232240.pdf
Source: http://www.di.univr.it/documenti/OccorrenzaIns/matdid/matdid232240.pdf
DATA VAULT 2.0 MODELING:
HUBS, LINKS & SATELLITES
@wiresoft/Pathfinder
Impressions
(Big Data)
Core
Business
Services
Core
Business
Services
Core
Business
Services
Operational
Data Stores
D
A
T
A
L
A
K
E
Enterprise Data Warehouse
C DC ,
snapshot
Internet
External
Data
Sources
Big Data Toolchain
Batch
(SerDe)
Staging
Vault
Raw
Vault
Business
Vault
Information
Mart
Streaming
(Kafka)
Streaming
Analytics
Batch Analytics
(Hadoop)
Schema-on-Read
Schema-on-Write
Data Source
Landing
C lients
ETL
ELT
BI Tools
Monitoring
Discovery
Audit
clickstream
(SerDe)
ETL ETL
THE DATA
Impressions vs Business Data
ENTERPRISE DATA SILOS
Small Data
Large Data
Big Data
Describes the
user base
Describes the
Enterprise
Describes the
Product
Instance
Grain
Transaction
Grain
Audit Grain
Impression Grain
Big Data
Enterprise Data
Warehouse
Operational Data Stores
Impression
Analytics
Business
Analytics
External Data Sources
DATA GRANULARITY FUNNEL
Impressions
(Big Data)
Core
Business
Services
Core
Business
Services
Core
Business
Services
Operational
Data Stores
D
A
T
A
L
A
K
E
Enterprise Data Warehouse
CDC,
snapshot
Internet
External
Data
Sources
Big Data Toolchain
Batch
(SerDe)
Staging
Vault
Raw
Vault
Business
Vault
Information
Mart
Streaming
(Kafka)
Streaming
Analytics
Batch Analytics
(Hadoop)
Schema-on-Read
Schema-on-Write
Data Source
Landing
C lients
ETL
ELT
BI Tools
Monitoring
Discovery
Audit
clickstream
(SerDe)
ETL ETL
DATA INGESTION
ETL vs ELT vs SerDe
ETL
VS
ELT
VS
SerDe
• Beware the Turing tar-pit, in which
everything is possible, but nothing
of interest is easy
- Alan Perlis
DATA CLASSIFICATION
MATRIX:
DECLARATIVE VS INTERPRETIVE
Declarative Interpretive
Hadoo
p
RDBMS
Web Events
Media Player
DATA WAREHOUSING
• Deep Topic
• 60’s, 70’s, 80’s
• E.F. Codd => 3NF
• Bill Inmon invents Data Warehousing
concept
• Dr. Ralph Kimball popularizes Star Schema
design
• 90’s, 00’s:
• Dan Linstedt creates Data Vault Model @
DOD
• 2014:
• Dan Introduces Data Vault 2.0
• Data Warehouse vs Operational Data
Stores
• Data Warehouse as Version Control System
BIG DATA
• MapReduce, 2004, Google by Jeffery
Dean and Sanjay, “MAPREDUCE:
SIMPLIFIED DATA PROCESSING ON
LARGE CLUSTERS” , GFS
• Nutch 2005, Hadoop 2006, 2007 - Doug
Cutting
• What exactly is “Big Data”?
Client
User
Interpreter
Analysis
UNSTRUCTURED USER EXPERIENCE
L
L n L i
lossy
Client
User
Time Series
Event
Record
Analysis
STRUCTURED USER EXPERIENCE
lossless
L p L p
L e
ETL OR SERDE ?
S3
Hadoop
Time Series
Event Record
Analysis
Deserializer
L e
L
d
L m
Client
User
Serializer
L p
L p
Eventlog.e Eventlog.d
L
e
Single Source
(Version Locked)
Kafka/Kinesis
Le
Internet
ETL
ELT
(SerDe)
vs
Source: https://www.ironsidegroup.com/2015/03/01/etl-vs-elt-whats-the-big-difference/
Schema
On
Write
Schema
On
Read
OTHER CHALLENGES
• Satellites must be loaded chronologically
• Time-based scheduling vs data-availability scheduling
QUESTIONS?
• Contact:
 Dean Hallman
 rdhallman@gmail.com
 Linkedin: https://www.linkedin.com/in/dean-hallman/

More Related Content

PPTX
Data Vault 2.0: Big Data Meets Data Warehousing
PDF
Big Data or Data Warehousing? How to Leverage Both in the Enterprise
PDF
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
PDF
Webinar future dataintegration-datamesh-and-goldengatekafka
PPTX
Big Data Analytics in the Cloud with Microsoft Azure
PDF
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
PDF
Flash session -streaming--ses1243-lon
PDF
Webinar Data Mesh - Part 3
Data Vault 2.0: Big Data Meets Data Warehousing
Big Data or Data Warehousing? How to Leverage Both in the Enterprise
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Webinar future dataintegration-datamesh-and-goldengatekafka
Big Data Analytics in the Cloud with Microsoft Azure
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
Flash session -streaming--ses1243-lon
Webinar Data Mesh - Part 3

Similar to datavault2.pptx (20)

PPTX
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
PDF
10/ EnterpriseDB @ OPEN'16
PPTX
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
PDF
Building Fast Applications for Streaming Data
PDF
Building Custom Big Data Integrations
PDF
Data Mesh Part 4 Monolith to Mesh
PDF
Cerebro: Bringing together data scientists and bi users - Royal Caribbean - S...
PDF
Trivadis Azure Data Lake
PPTX
The Future of Data Warehousing and Data Integration
PDF
Dealing with Unstructured Data: Scaling to Infinity
PDF
Scaling to Infinity - Open Source meets Big Data
PDF
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
PDF
Unlocking the Value of Your Data Lake
PDF
A Key to Real-time Insights in a Post-COVID World (ASEAN)
PDF
Metadata Lakes for Next-Gen AI/ML - Lisa N. Cao
PDF
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
PDF
The Great Lakes: How to Approach a Big Data Implementation
PDF
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
PDF
A Tale of Two BI Standards
PDF
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
10/ EnterpriseDB @ OPEN'16
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Building Fast Applications for Streaming Data
Building Custom Big Data Integrations
Data Mesh Part 4 Monolith to Mesh
Cerebro: Bringing together data scientists and bi users - Royal Caribbean - S...
Trivadis Azure Data Lake
The Future of Data Warehousing and Data Integration
Dealing with Unstructured Data: Scaling to Infinity
Scaling to Infinity - Open Source meets Big Data
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Unlocking the Value of Your Data Lake
A Key to Real-time Insights in a Post-COVID World (ASEAN)
Metadata Lakes for Next-Gen AI/ML - Lisa N. Cao
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
The Great Lakes: How to Approach a Big Data Implementation
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
A Tale of Two BI Standards
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Ad

Recently uploaded (20)

PDF
Digital Infrastructure – Powering the Connected Age
PPTX
Extract Transformation Load (3) (1).pptx
PDF
Company Profile 2023 PT. ZEKON INDONESIA.pdf
PPTX
Lecture 1 Intro in Inferential Statistics.pptx
PDF
AI Lect 2 Identifying AI systems, branches of AI, etc.pdf
PDF
Mastering Query Optimization Techniques for Modern Data Engineers
PPTX
咨询新西兰毕业证(UCOL毕业证书)联合理工学院毕业证国外毕业证
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
artificial intelligence deeplearning-200712115616.pptx
PPTX
Challenges and opportunities in feeding a growing population
PPTX
Purple and Violet Modern Marketing Presentation (1).pptx
PDF
CB-Insights_Artificial-Intelligence-Report-Q2-2025.pdf
PPTX
办理新西兰毕业证(Lincoln毕业证书)林肯大学毕业证毕业 证
PPTX
Data-Driven-Credit-Card-Launch-A-Wells-Fargo-Case-Study.pptx
PPTX
LESSON-1-NATURE-OF-MATHEMATICS.pptx patterns
PDF
Report The-State-of-AIOps 20232032 3.pdf
PPTX
Economic Sector Performance Recovery.pptx
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
DOCX
Estimating GW Storage Variability Using GRACE derived data_Paper.docx
PPTX
Presentation1.pptxvhhh. H ycycyyccycycvvv
Digital Infrastructure – Powering the Connected Age
Extract Transformation Load (3) (1).pptx
Company Profile 2023 PT. ZEKON INDONESIA.pdf
Lecture 1 Intro in Inferential Statistics.pptx
AI Lect 2 Identifying AI systems, branches of AI, etc.pdf
Mastering Query Optimization Techniques for Modern Data Engineers
咨询新西兰毕业证(UCOL毕业证书)联合理工学院毕业证国外毕业证
Business Acumen Training GuidePresentation.pptx
artificial intelligence deeplearning-200712115616.pptx
Challenges and opportunities in feeding a growing population
Purple and Violet Modern Marketing Presentation (1).pptx
CB-Insights_Artificial-Intelligence-Report-Q2-2025.pdf
办理新西兰毕业证(Lincoln毕业证书)林肯大学毕业证毕业 证
Data-Driven-Credit-Card-Launch-A-Wells-Fargo-Case-Study.pptx
LESSON-1-NATURE-OF-MATHEMATICS.pptx patterns
Report The-State-of-AIOps 20232032 3.pdf
Economic Sector Performance Recovery.pptx
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
Estimating GW Storage Variability Using GRACE derived data_Paper.docx
Presentation1.pptxvhhh. H ycycyyccycycvvv
Ad

datavault2.pptx