Week 2 Data Rols DataPlatfro Use Cases v1 S25
Week 2 Data Rols DataPlatfro Use Cases v1 S25
Unless Otherwise Stated, this presentation refers to study material from Microsoft Azure Learn, AWS documentation, and Snowflake Academic Courses.
• Recap from the last week
• Close look at Data Lake
• Azure Data Engineering - Labs 1
3
Data architectures
Need for Design Principles for Data Pipelines
Changing business and technology landscape 5
Application Evolution
Data Silos Relational Databases Non-Relational Databases Data Lakes Cloud based purpose-built data stores
Data Lakes
Data Lakehouse
Data Mesh
Data Fabric
Data warehouses Lambda architecture and
and OLTP vs. OLAP databases Big data systems streaming solutions
Application databases are Relational databases cannot Big data systems can't keep up
overburdened scale effectively for analytics with demands for real-time
and AI/ML analysis
Hierarchical The internet's data Big data and AI/ML Cloud microservices
databases are need to store huge increase demand for
Data Evolution variety doesn't
data stores that are
too rigid for perform well in volumes of
complex data relational schemas unstructured and matched to data type
relationships semistructured data and function
Sources of Data 7
Business or HRMS, ERP, CRM, PPM, Structured Low Mid Mid Low
Enterprise EMR
Application
Documents PDF, XLS, JSSON and so Unstructured Mid Low Low Mid
on
Data Storage File streams, NoSQL, R- Structured/ Low Mid High Mid
ORDBMS Hybrid
© 2022, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
9
© 2022, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
9
10
Evaluate
results
Additional and
3
data iterate
2
1
Predictions
Data
and
sources
decisions
© 2022, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
10
11
Data types
More flexible
"Colder"
Operational data workloads 14
Order
… … …
… … …
* * *
2
3
1
4
▲----
▼----
▲----
1 Operational data is extracted, transformed, and loaded (ETL) into a data lake for analysis
2 Data is loaded into a schema of tables - typically in a Spark-based data lakehouse with tabular abstractions over files in
the data lake, or a data warehouse with a fully relational SQL engine
3 Data in tables may be aggregated and loaded into an online analytical processing (OLAP) model, or cube
4 The files in the data lake, relational tables, and analytical model can be queried to produce reports and dashboards
Clean and
Preserve the integrity of the data as
transform data that
it's combined with other sources,
enters the pipeline
transformed, processed, and analyzed
Process
Data Ingest Analyze
sources
Store
Source: https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html
19
Data Platforms
Understanding foundations
20
A Data Platform should support data engineers,
data analysts, data operations professionals
Data professionals work with multiple types of data to perform a variety of data operations using a range of tools and
scripting languages.
SQL
Structured Integration
SELECT…
Python
Semi-structured Transformation
df=spark.read(…)
R Java
Unstructured Consolidation .NET Others
Scala
Analysis
Analytical data stored in files Analytical data stored in a relational database Open-source engine for distributed
data processing
Distributed storage for massive scalability Typically modeled as a star schema to optimize
summary analysis
Break (5 min)
26
23
Key design
considerations
• Scalable data lake
• Performant and
cost-effective
components
• Seamless data
movement
• Unified governance
Key design
considerations
• Seamless data
movement
• Unified
governance Lake
Formation
AWS Glue
• Depends on the
application and
business use cases
• Depends on OLTP and
OLAP data workloads
Microsoft Azure
Operational data Data ingestion/ETL Analytical data storage and Data modeling and
processing visualization
Azure Databricks
Azure Data Factory
Source: Analytics end-to-end with Azure Synapse - Azure Architecture Center | Microsoft Learn
Matching ingestion services to variety, volume, and 37
31
velocity
Ingest
Azure Data
SaaS apps Factory
Azure Synapse
Analytics Pipelines
OLTP ERP CRM Business
Applications
Azure File Sync
File shares
Azur Event Hub
On Premise Data-
limited connectivity
Store and Manager Enterprise Data 38
32
Azure Synapse
Analytics
Built-in integration
Ingest Process
Azure Data
Lake Storage
Gen 2
41
35
Souce: Data lake zones and containers - Cloud Adoption Framework | Microsoft Learn
42
36
Souce: Data lake zones and containers - Cloud Adoption Framework | Microsoft Learn
Data Lake Zones/Layers. Continued… 43
37
Souce: Data lake zones and containers - Cloud Adoption Framework | Microsoft Learn
44
38
Azure Stream
Elastic Jobs on Microsoft Purview Azure Data
Analytics
Azure Factory
46
40
Break
5 min
42
First Lab
Lab 1: Explore Synapse
Azure Synapse is an enterprise analytics service that accelerates time to insight across
data warehouses and big data systems. Azure Synapse brings together the best of SQL
technologies used in enterprise data warehousing, Spark technologies used for big
data, Data Explorer for log and time series analytics, Pipelines for data integration and
ETL/ELT, and deep integration with other Azure services such as Power BI, CosmosDB,
and AzureML.
First Lab
First Lab
Let’s go to Skillable!
47
49
Next Week
Thank you
51
Next Week