SlideShare a Scribd company logo
When and How Data
Lakes Fit into a Modern
Data Architecture
Presented by: William McKnight
“#1 Global Influencer in Data Warehousing” Onalytica
President, McKnight Consulting Group
An Inc. 5000 Company in 2018 and 2017
@williammcknight
www.mcknightcg.com
(214) 514-1444
Second Thursday of Every Month, at 2:00 ET
#AdvAnalytics
William McKnight
President, McKnight Consulting Group
• Frequent keynote speaker and trainer internationally
• Consulted to Pfizer, Scotiabank, Fidelity, TD Ameritrade, Teva
Pharmaceuticals, Verizon, and many other Global 1000 companies
• Hundreds of articles, blogs, benchmarks and white papers in
publication
• Focused on delivering business value and solving business problems
utilizing proven, streamlined approaches to information management
• Former Database Engineer, Fortune 50 Information Technology
executive and Ernst&Young Entrepreneur of Year Finalist
• Owner/consultant: 2018 & 2017 Inc. 5000 Data Strategy and
Implementation consulting firm
• Brings 25+ years of information management and DBMS experience
McKnight Consulting Group Offerings
Strategy
Training
Strategy
§ Trusted Advisor
§ Action Plans
§ Roadmaps
§ Tool Selections
§ Program Management
Training
§ Classes
§ Workshops
Implementation
§ Data/Data Warehousing/Business
Intelligence/Analytics
§ Master Data Management
§ Governance/Quality
§ Big Data
Implementation
3
Analytic Data Stores
3 Major Decisions
• Decision #1: The Data Store Type
– The largest factor for distinguishing between databases and file-based scale-out system
utilization is the data profile. The latter is best for data that fits the loose label of 'unstructured'
(or semi-structured) data, while more traditional data -- and smaller volumes of all data -- still
belong in a relational database.
• Decision #2: Data Store Placement
– You must also decide where to place your data store -- on-premises or in the cloud (and which
cloud). In the past, the only clear choice for most organizations was on-premises data. However,
the costs of scale are gnawing away at the notion that this remains the best approach for a data
platform. For more on why databases are moving to the cloud, please read this article.
• Decision #3: The Workload Architecture
– Finally, you must keep in mind the distinction between operational or analytical workloads.
Short transactional requests and more complex (often longer) analytics requests demand
different architectures. Analytics databases, though quite diverse, are the preferred platforms
for the analytics workload.
5
Whither the idea of the Data Warehouse?
Intake
Export
Files
Txn
App
Data
Full
Delta
Stream
Structured
Big Data
TIER 1
Access1..n
Regional and
Departmental
Views
ADS
Applications
& Engines
Operational
Analytics &
Hot Views
Data Marts
Independent
Dependent
Relational
Data
TIER 3
Conformed
Dimensions
Distribution
Common Summary
and Derived Values
Master Data
Reference Data Hub
Transaction
Data Hub
TIER 2
6
Data Warehousing
• Data Warehouses (still) have a
lower total cost of ownership than
data marts
• A data warehouse is a SHARED
platform
– Build once, use many
– Access at Data Warehouse
– Access by creating a mart off the DW
• Still A LOT cheaper than building from
scratch
“… a subject-
oriented, integrated,
non-volatile, time-
variant collection of
data, organized to
support
management
needs.” — Bill Inmon
Reasons for Analytic Architecture Change
• Take Advantage Of…
– Cloud Databases
– Get into a Columnar Data Orientation
– Get into the Data Architecture you want
– Cloud Storage
• Projects Requiring Consolidated Data
8
The Key is Right-Fitting Platforms
• THE Data Warehouse
– Value-Added Components: Modeling for Access,
Data Quality, Tooling, Conformed Dimensions,
Data Governance, Etc.
• A Dependent Data Mart (Fed from the Data
Warehouse)
• A Data Lake
• A Big Data Cluster
• An Independent Data Mart
• An Operational Hub
• An Operational Data Lake
9
Data
Lake
Usage Understanding by the Builders
D
a
t
a
C
u
l
t
i
v
a
t
i
o
n
Data
Warehouse
Data
Mart
Sensible Divisions of Analytic Platforms
The Post-Operational Ecosystem
Data Lake
DW
DM
DM
11
Usage Understanding by the Builders
D
a
t
a
C
u
l
t
i
v
a
t
i
o
n
Data
Warehouse
/Lake
What If?
Data
Mart
Deploying the Data Lake
Data Lake
Data Scientist Workbench and Data
Warehouse Staging
OLTP
Systems
Data Lake
Data Scientists
ERP
CRM
Supply
Chain
MDM
…
Data
Warehouse
Data Mart
Stream or
Batch
Updates
DI
Real-Time,
Event-Driven
Apps
14
Data Lake Patterns
• Data Refinery
– Do Data Warehouse ETL in the Data Lake
• Archive Storage
• Data Science Lab
• [Data Lake as the Data Warehouse]
15
Files
RDBMS
Streaming
Data
Sources
Ingest
Governance
Process
Central Data Store
Kafka, Pulsar
Snowball
Kinesis
QuickSight
HadoopCloud Storage
EMR
Glue
Catalog & User Interface Access Management
DynamoDB ElasticSearch Web
Interface
API Gateway IAM & Cognito
Analyze
Python
R
Machine
Learning
Data Lake Example Components
16
Data Lake Setup
• Managed deployments in the Hadoop
family of products
• External tables in Hive metastore that point
at cloud storage (Amazon S3, Google
Cloud Storage, Azure Data Lake Storage
Gen 2)
– To run SQL against the data
– HiveQL and Spark SQL require entries in the
metastore
17
Object Storage Instances
• Object Storage instances/clusters have local
storage, i.e., on the physical drives mounted to
the instances themselves, that is HDFS and
Hive
• Object Storage technologies access their
cloud vendor’s respective cloud storage—viz.:
– Amazon EMR accesses S3
– Dataproc accesses Google Cloud Storage
– HDI accesses Azure Data Lake Storage Gen2
• Local storage is used by the Object Storage
platform for housekeeping
18
The Data Warehouse of the Future
• Pair a lake with an analytical engine that
charges only by what you use
• If you have a ton of data that can sit in cold
storage and only needs to be accessed or
analyzed occasionally, store it in Amazon
S3/Azure Blob Storage/Google Cloud Storage
– Use a database (on-premise or in the cloud) that
can create external tables that point at the storage
– Analysts can query directly against it, or draw down
a subset for some deeper/intensive analysis
– The GB/month storage fee plus data
transfer/egress fees will be much cheaper than
leaving it in a data warehouse
19
Notes on the Data Warehouse of the Future
• More Achievable separate compute and storage architecture
• Compute resources (Map/Reduce, Hive, Spark, etc.) can be
taken down, scaled up or out, or interchanged without data
movement
• Storage can be centralized, but compute can be distributed
• Major players have mechanism to ensure consistency to achieve
ACID-like compliance
• Remote data replication to ensure redundancy and recovery
• Most of the query execution is processing time, and not data
transport, so if cloud compute and storage are in the same
cloud vendor region, performance is hardly impacted
20
Sample Cluster Configuration
Google BigQuery
Cloud Provider Google Cloud
Platform Version 3.6
Hadoop Version 2.7.3
Hive Version 1.2.1
Spark Version 2.3.2
Instance Type n1-highmem-16
Head/Master Nodes 1
Worker Nodes 16 and 32
vCPUs (per node) 16
RAM (per node) 104 GB
Compute Cost
(per node per hour)
$0.947
Platform Premium (per node per hour) $0.160
21
Tips
• If possible, configure remote data to be stored in parquet format, as
opposed to comma-separated or other text format
• As new data sources are added to cloud storage, use a code
distribution system—like Github—to distribute new table definitions
to distributed teams
• Use data partitioning to improve performance—but don’t forget new
partitions have to be declared to the Hive metastore when they are
added to the data
• Co-locate compute and storage in the same region
• Use AES-256 encryption on cloud storage bucket to ensure encryption
at-rest
• Hold the remotely-stored data to the same governance and data
quality standards you would if it were on-premise—consider a data
catalog or other metadata technique to keep the data organized and
easy-to-find for new compute engines
• Drop commonly used data in the lake, like master data from MDM
22
The Data Science Lab Role of
the Data Lake
Artificial Intelligence and Machine Learning
• Looming on the horizon is an injection of
AI/ML into every piece of software
• Consider the domain of data integration
– Predicting with high accuracy the steps ahead
– Fixing its bugs
• Machine learning is being built into databases
so the data will be analyzed as it is loaded
– I.e., Python with TensorFlow and Scala on Spark.
• The split of the necessary AI/ML between the
"edge" of corporate users and the software
itself is still to be determined
24
Training Data for Machine Learning &
Artificial Intelligence
• You must have enough data to analyze to
build models
• Your data determines the depth of AI you
can achieve -- for example, statistical
modeling, machine learning, or deep
learning -- and its accuracy
25
AI Data
• Call center recordings and chat logs
• Streaming sensor data, historical maintenance records and
search logs
• Customer account data and purchase history
• Email response metrics
• Product catalogs and data sheets
• Public references
• YouTube video content audio tracks
• User website behaviors
• Sentiment analysis, user-generated content, social graph
data, and other external data sources
26
When and How Data
Lakes Fit into a Modern
Data Architecture
Presented by: William McKnight
“#1 Global Influencer in Data Warehousing” Onalytica
President, McKnight Consulting Group
An Inc. 5000 Company in 2018 and 2017
@williammcknight
www.mcknightcg.com
(214) 514-1444
Second Thursday of Every Month, at 2:00 ET
#AdvAnalytics

More Related Content

What's hot (20)

PDF
Drive your business with predictive analytics
The Marketing Distillery
 
PPTX
IDERA Slides: Managing Complex Data Environments
DATAVERSITY
 
PDF
Emerging Trends in Data Architecture – What’s the Next Big Thing?
DATAVERSITY
 
PDF
Enterprise Architecture vs. Data Architecture
DATAVERSITY
 
PDF
DAS Slides: Data Virtualization – Separating Myth from Reality
DATAVERSITY
 
PDF
Modern Integrated Data Environment - Whitepaper | Qubole
Vasu S
 
PDF
Data Management Meets Human Management - Why Words Matter
DATAVERSITY
 
PDF
Bringing Strategy to Life: Using an Intelligent Data Platform to Become Data ...
DLT Solutions
 
PDF
Next generation Data Governance
Vladimiro Borsi
 
PDF
DataOps - The Foundation for Your Agile Data Architecture
DATAVERSITY
 
PDF
Big Data Analytics Architecture PowerPoint Presentation Slides
SlideTeam
 
PDF
Building an Effective Data & Analytics Operating Model A Data Modernization G...
Mark Hewitt
 
PDF
Slides: Enterprise Architecture vs. Data Architecture
DATAVERSITY
 
PDF
Metadata Strategies
DATAVERSITY
 
PDF
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
DATAVERSITY
 
PDF
Death of the Dashboard
DATAVERSITY
 
PDF
TeraStream - Data Integration/Migration/ETL/Batch Tool
DataStreams
 
PDF
Unlocking the Value of Your Data Lake
DATAVERSITY
 
PDF
Data Lake Architecture – Modern Strategies & Approaches
DATAVERSITY
 
PPTX
Data Governance and Analytics
Syed Jahanzaib Bin Hassan - JBH Syed
 
Drive your business with predictive analytics
The Marketing Distillery
 
IDERA Slides: Managing Complex Data Environments
DATAVERSITY
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
DATAVERSITY
 
Enterprise Architecture vs. Data Architecture
DATAVERSITY
 
DAS Slides: Data Virtualization – Separating Myth from Reality
DATAVERSITY
 
Modern Integrated Data Environment - Whitepaper | Qubole
Vasu S
 
Data Management Meets Human Management - Why Words Matter
DATAVERSITY
 
Bringing Strategy to Life: Using an Intelligent Data Platform to Become Data ...
DLT Solutions
 
Next generation Data Governance
Vladimiro Borsi
 
DataOps - The Foundation for Your Agile Data Architecture
DATAVERSITY
 
Big Data Analytics Architecture PowerPoint Presentation Slides
SlideTeam
 
Building an Effective Data & Analytics Operating Model A Data Modernization G...
Mark Hewitt
 
Slides: Enterprise Architecture vs. Data Architecture
DATAVERSITY
 
Metadata Strategies
DATAVERSITY
 
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
DATAVERSITY
 
Death of the Dashboard
DATAVERSITY
 
TeraStream - Data Integration/Migration/ETL/Batch Tool
DataStreams
 
Unlocking the Value of Your Data Lake
DATAVERSITY
 
Data Lake Architecture – Modern Strategies & Approaches
DATAVERSITY
 
Data Governance and Analytics
Syed Jahanzaib Bin Hassan - JBH Syed
 

Similar to ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture (20)

PDF
When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
PDF
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
DATAVERSITY
 
PDF
Using Data Platforms That Are Fit-For-Purpose
DATAVERSITY
 
PPTX
Data Lake Overview
James Serra
 
PDF
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
Adaryl "Bob" Wakefield, MBA
 
PPTX
Chap3-Data Warehousing and OLAP operations..pptx
stuti8985
 
PPTX
Is the traditional data warehouse dead?
James Serra
 
PDF
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
DATAVERSITY
 
PPTX
Architecting a Modern Data Warehouse: Enterprise Must-Haves
Yellowbrick Data
 
PPTX
Hadoop and Your Data Warehouse
Caserta
 
PDF
Got data?… now what? An introduction to modern data platforms
JamesAnderson599331
 
PDF
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
Rittman Analytics
 
PDF
The Shifting Landscape of Data Integration
DATAVERSITY
 
PPTX
Data lake-itweekend-sharif university-vahid amiry
datastack
 
PDF
Enable Better Decision Making with Power BI Visualizations & Modern Data Estate
CCG
 
PPTX
Modern data warehouse
Elena Lopez
 
PPTX
Building a Big Data Solution
James Serra
 
PDF
Introduction Big Data
Frank Kienle
 
PDF
The State of the Data Warehouse in 2017 and Beyond
SingleStore
 
PPTX
Big data architectures and the data lake
James Serra
 
When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
DATAVERSITY
 
Using Data Platforms That Are Fit-For-Purpose
DATAVERSITY
 
Data Lake Overview
James Serra
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
Adaryl "Bob" Wakefield, MBA
 
Chap3-Data Warehousing and OLAP operations..pptx
stuti8985
 
Is the traditional data warehouse dead?
James Serra
 
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
DATAVERSITY
 
Architecting a Modern Data Warehouse: Enterprise Must-Haves
Yellowbrick Data
 
Hadoop and Your Data Warehouse
Caserta
 
Got data?… now what? An introduction to modern data platforms
JamesAnderson599331
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
Rittman Analytics
 
The Shifting Landscape of Data Integration
DATAVERSITY
 
Data lake-itweekend-sharif university-vahid amiry
datastack
 
Enable Better Decision Making with Power BI Visualizations & Modern Data Estate
CCG
 
Modern data warehouse
Elena Lopez
 
Building a Big Data Solution
James Serra
 
Introduction Big Data
Frank Kienle
 
The State of the Data Warehouse in 2017 and Beyond
SingleStore
 
Big data architectures and the data lake
James Serra
 
Ad

More from DATAVERSITY (20)

PDF
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
DATAVERSITY
 
PDF
Data at the Speed of Business with Data Mastering and Governance
DATAVERSITY
 
PDF
Exploring Levels of Data Literacy
DATAVERSITY
 
PDF
Building a Data Strategy – Practical Steps for Aligning with Business Goals
DATAVERSITY
 
PDF
Make Data Work for You
DATAVERSITY
 
PDF
Data Catalogs Are the Answer – What is the Question?
DATAVERSITY
 
PDF
Data Catalogs Are the Answer – What Is the Question?
DATAVERSITY
 
PDF
Data Modeling Fundamentals
DATAVERSITY
 
PDF
Showing ROI for Your Analytic Project
DATAVERSITY
 
PDF
How a Semantic Layer Makes Data Mesh Work at Scale
DATAVERSITY
 
PDF
Is Enterprise Data Literacy Possible?
DATAVERSITY
 
PDF
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
DATAVERSITY
 
PDF
Emerging Trends in Data Architecture – What’s the Next Big Thing?
DATAVERSITY
 
PDF
Data Governance Trends - A Look Backwards and Forwards
DATAVERSITY
 
PDF
Data Governance Trends and Best Practices To Implement Today
DATAVERSITY
 
PDF
2023 Trends in Enterprise Analytics
DATAVERSITY
 
PDF
Data Strategy Best Practices
DATAVERSITY
 
PDF
Who Should Own Data Governance – IT or Business?
DATAVERSITY
 
PDF
Data Management Best Practices
DATAVERSITY
 
PDF
MLOps – Applying DevOps to Competitive Advantage
DATAVERSITY
 
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
DATAVERSITY
 
Data at the Speed of Business with Data Mastering and Governance
DATAVERSITY
 
Exploring Levels of Data Literacy
DATAVERSITY
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
DATAVERSITY
 
Make Data Work for You
DATAVERSITY
 
Data Catalogs Are the Answer – What is the Question?
DATAVERSITY
 
Data Catalogs Are the Answer – What Is the Question?
DATAVERSITY
 
Data Modeling Fundamentals
DATAVERSITY
 
Showing ROI for Your Analytic Project
DATAVERSITY
 
How a Semantic Layer Makes Data Mesh Work at Scale
DATAVERSITY
 
Is Enterprise Data Literacy Possible?
DATAVERSITY
 
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
DATAVERSITY
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
DATAVERSITY
 
Data Governance Trends - A Look Backwards and Forwards
DATAVERSITY
 
Data Governance Trends and Best Practices To Implement Today
DATAVERSITY
 
2023 Trends in Enterprise Analytics
DATAVERSITY
 
Data Strategy Best Practices
DATAVERSITY
 
Who Should Own Data Governance – IT or Business?
DATAVERSITY
 
Data Management Best Practices
DATAVERSITY
 
MLOps – Applying DevOps to Competitive Advantage
DATAVERSITY
 
Ad

Recently uploaded (20)

PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PDF
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPT
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 

ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture

  • 1. When and How Data Lakes Fit into a Modern Data Architecture Presented by: William McKnight “#1 Global Influencer in Data Warehousing” Onalytica President, McKnight Consulting Group An Inc. 5000 Company in 2018 and 2017 @williammcknight www.mcknightcg.com (214) 514-1444 Second Thursday of Every Month, at 2:00 ET #AdvAnalytics
  • 2. William McKnight President, McKnight Consulting Group • Frequent keynote speaker and trainer internationally • Consulted to Pfizer, Scotiabank, Fidelity, TD Ameritrade, Teva Pharmaceuticals, Verizon, and many other Global 1000 companies • Hundreds of articles, blogs, benchmarks and white papers in publication • Focused on delivering business value and solving business problems utilizing proven, streamlined approaches to information management • Former Database Engineer, Fortune 50 Information Technology executive and Ernst&Young Entrepreneur of Year Finalist • Owner/consultant: 2018 & 2017 Inc. 5000 Data Strategy and Implementation consulting firm • Brings 25+ years of information management and DBMS experience
  • 3. McKnight Consulting Group Offerings Strategy Training Strategy § Trusted Advisor § Action Plans § Roadmaps § Tool Selections § Program Management Training § Classes § Workshops Implementation § Data/Data Warehousing/Business Intelligence/Analytics § Master Data Management § Governance/Quality § Big Data Implementation 3
  • 5. 3 Major Decisions • Decision #1: The Data Store Type – The largest factor for distinguishing between databases and file-based scale-out system utilization is the data profile. The latter is best for data that fits the loose label of 'unstructured' (or semi-structured) data, while more traditional data -- and smaller volumes of all data -- still belong in a relational database. • Decision #2: Data Store Placement – You must also decide where to place your data store -- on-premises or in the cloud (and which cloud). In the past, the only clear choice for most organizations was on-premises data. However, the costs of scale are gnawing away at the notion that this remains the best approach for a data platform. For more on why databases are moving to the cloud, please read this article. • Decision #3: The Workload Architecture – Finally, you must keep in mind the distinction between operational or analytical workloads. Short transactional requests and more complex (often longer) analytics requests demand different architectures. Analytics databases, though quite diverse, are the preferred platforms for the analytics workload. 5
  • 6. Whither the idea of the Data Warehouse? Intake Export Files Txn App Data Full Delta Stream Structured Big Data TIER 1 Access1..n Regional and Departmental Views ADS Applications & Engines Operational Analytics & Hot Views Data Marts Independent Dependent Relational Data TIER 3 Conformed Dimensions Distribution Common Summary and Derived Values Master Data Reference Data Hub Transaction Data Hub TIER 2 6
  • 7. Data Warehousing • Data Warehouses (still) have a lower total cost of ownership than data marts • A data warehouse is a SHARED platform – Build once, use many – Access at Data Warehouse – Access by creating a mart off the DW • Still A LOT cheaper than building from scratch “… a subject- oriented, integrated, non-volatile, time- variant collection of data, organized to support management needs.” — Bill Inmon
  • 8. Reasons for Analytic Architecture Change • Take Advantage Of… – Cloud Databases – Get into a Columnar Data Orientation – Get into the Data Architecture you want – Cloud Storage • Projects Requiring Consolidated Data 8
  • 9. The Key is Right-Fitting Platforms • THE Data Warehouse – Value-Added Components: Modeling for Access, Data Quality, Tooling, Conformed Dimensions, Data Governance, Etc. • A Dependent Data Mart (Fed from the Data Warehouse) • A Data Lake • A Big Data Cluster • An Independent Data Mart • An Operational Hub • An Operational Data Lake 9
  • 10. Data Lake Usage Understanding by the Builders D a t a C u l t i v a t i o n Data Warehouse Data Mart Sensible Divisions of Analytic Platforms
  • 12. Usage Understanding by the Builders D a t a C u l t i v a t i o n Data Warehouse /Lake What If? Data Mart
  • 14. Data Lake Data Scientist Workbench and Data Warehouse Staging OLTP Systems Data Lake Data Scientists ERP CRM Supply Chain MDM … Data Warehouse Data Mart Stream or Batch Updates DI Real-Time, Event-Driven Apps 14
  • 15. Data Lake Patterns • Data Refinery – Do Data Warehouse ETL in the Data Lake • Archive Storage • Data Science Lab • [Data Lake as the Data Warehouse] 15
  • 16. Files RDBMS Streaming Data Sources Ingest Governance Process Central Data Store Kafka, Pulsar Snowball Kinesis QuickSight HadoopCloud Storage EMR Glue Catalog & User Interface Access Management DynamoDB ElasticSearch Web Interface API Gateway IAM & Cognito Analyze Python R Machine Learning Data Lake Example Components 16
  • 17. Data Lake Setup • Managed deployments in the Hadoop family of products • External tables in Hive metastore that point at cloud storage (Amazon S3, Google Cloud Storage, Azure Data Lake Storage Gen 2) – To run SQL against the data – HiveQL and Spark SQL require entries in the metastore 17
  • 18. Object Storage Instances • Object Storage instances/clusters have local storage, i.e., on the physical drives mounted to the instances themselves, that is HDFS and Hive • Object Storage technologies access their cloud vendor’s respective cloud storage—viz.: – Amazon EMR accesses S3 – Dataproc accesses Google Cloud Storage – HDI accesses Azure Data Lake Storage Gen2 • Local storage is used by the Object Storage platform for housekeeping 18
  • 19. The Data Warehouse of the Future • Pair a lake with an analytical engine that charges only by what you use • If you have a ton of data that can sit in cold storage and only needs to be accessed or analyzed occasionally, store it in Amazon S3/Azure Blob Storage/Google Cloud Storage – Use a database (on-premise or in the cloud) that can create external tables that point at the storage – Analysts can query directly against it, or draw down a subset for some deeper/intensive analysis – The GB/month storage fee plus data transfer/egress fees will be much cheaper than leaving it in a data warehouse 19
  • 20. Notes on the Data Warehouse of the Future • More Achievable separate compute and storage architecture • Compute resources (Map/Reduce, Hive, Spark, etc.) can be taken down, scaled up or out, or interchanged without data movement • Storage can be centralized, but compute can be distributed • Major players have mechanism to ensure consistency to achieve ACID-like compliance • Remote data replication to ensure redundancy and recovery • Most of the query execution is processing time, and not data transport, so if cloud compute and storage are in the same cloud vendor region, performance is hardly impacted 20
  • 21. Sample Cluster Configuration Google BigQuery Cloud Provider Google Cloud Platform Version 3.6 Hadoop Version 2.7.3 Hive Version 1.2.1 Spark Version 2.3.2 Instance Type n1-highmem-16 Head/Master Nodes 1 Worker Nodes 16 and 32 vCPUs (per node) 16 RAM (per node) 104 GB Compute Cost (per node per hour) $0.947 Platform Premium (per node per hour) $0.160 21
  • 22. Tips • If possible, configure remote data to be stored in parquet format, as opposed to comma-separated or other text format • As new data sources are added to cloud storage, use a code distribution system—like Github—to distribute new table definitions to distributed teams • Use data partitioning to improve performance—but don’t forget new partitions have to be declared to the Hive metastore when they are added to the data • Co-locate compute and storage in the same region • Use AES-256 encryption on cloud storage bucket to ensure encryption at-rest • Hold the remotely-stored data to the same governance and data quality standards you would if it were on-premise—consider a data catalog or other metadata technique to keep the data organized and easy-to-find for new compute engines • Drop commonly used data in the lake, like master data from MDM 22
  • 23. The Data Science Lab Role of the Data Lake
  • 24. Artificial Intelligence and Machine Learning • Looming on the horizon is an injection of AI/ML into every piece of software • Consider the domain of data integration – Predicting with high accuracy the steps ahead – Fixing its bugs • Machine learning is being built into databases so the data will be analyzed as it is loaded – I.e., Python with TensorFlow and Scala on Spark. • The split of the necessary AI/ML between the "edge" of corporate users and the software itself is still to be determined 24
  • 25. Training Data for Machine Learning & Artificial Intelligence • You must have enough data to analyze to build models • Your data determines the depth of AI you can achieve -- for example, statistical modeling, machine learning, or deep learning -- and its accuracy 25
  • 26. AI Data • Call center recordings and chat logs • Streaming sensor data, historical maintenance records and search logs • Customer account data and purchase history • Email response metrics • Product catalogs and data sheets • Public references • YouTube video content audio tracks • User website behaviors • Sentiment analysis, user-generated content, social graph data, and other external data sources 26
  • 27. When and How Data Lakes Fit into a Modern Data Architecture Presented by: William McKnight “#1 Global Influencer in Data Warehousing” Onalytica President, McKnight Consulting Group An Inc. 5000 Company in 2018 and 2017 @williammcknight www.mcknightcg.com (214) 514-1444 Second Thursday of Every Month, at 2:00 ET #AdvAnalytics