0% found this document useful (0 votes)
5 views

Week 2 Data Rols DataPlatfro Use Cases v1 S25

The document outlines the structure and principles of business analytics platforms, focusing on data architecture, data lakes, and the evolution of application architectures. It emphasizes the importance of designing data pipelines to support data-driven decision-making and highlights various data types and storage solutions. Additionally, it introduces key concepts related to operational and analytical data, as well as modern data architecture considerations using platforms like Azure and AWS.

Uploaded by

ojhashobha28
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Week 2 Data Rols DataPlatfro Use Cases v1 S25

The document outlines the structure and principles of business analytics platforms, focusing on data architecture, data lakes, and the evolution of application architectures. It emphasizes the importance of designing data pipelines to support data-driven decision-making and highlights various data types and storage solutions. Additionally, it introduces key concepts related to operational and analytical data, as well as modern data architecture considerations using platforms like Azure and AWS.

Uploaded by

ojhashobha28
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

BUAN6335

Organizing for Business Analytics Platforms


Week 3
Prof. Mandar Samant

Unless Otherwise Stated, this presentation refers to study material from Microsoft Azure Learn, AWS documentation, and Snowflake Academic Courses.
• Recap from the last week
• Close look at Data Lake
• Azure Data Engineering - Labs 1
3

Recap: Last Week


4

Data architectures
Need for Design Principles for Data Pipelines
Changing business and technology landscape 5

Data analysis and More layered , pay-as-


forecasting needs are for you-go options were
NOW! sought to expand, and
compress based on
need of an enterprise
Everything is connected!
Connected world
created more need to
handle, manage, an All such factors contributed
analyze data. to get cloud and data closer
and even inseparable in the
past few years
Application architecture evolution led to data revolution 6

over the decades

Application Evolution

Mainframe Client Server Internet Cloud Microservices


Centralized Distributed Granular/Reusable
Global 3-Tier IaaS/PaaS/SaaS

70-80s 90s 2000 2010 2020

Data Silos Relational Databases Non-Relational Databases Data Lakes Cloud based purpose-built data stores
Data Lakes
Data Lakehouse
Data Mesh
Data Fabric
Data warehouses Lambda architecture and
and OLTP vs. OLAP databases Big data systems streaming solutions

Application databases are Relational databases cannot Big data systems can't keep up
overburdened scale effectively for analytics with demands for real-time
and AI/ML analysis

Hierarchical The internet's data Big data and AI/ML Cloud microservices
databases are need to store huge increase demand for
Data Evolution variety doesn't
data stores that are
too rigid for perform well in volumes of
complex data relational schemas unstructured and matched to data type
relationships semistructured data and function
Sources of Data 7

Source Example Type Complexity Velocity Volume Variety

Business or HRMS, ERP, CRM, PPM, Structured Low Mid Mid Low
Enterprise EMR
Application
Documents PDF, XLS, JSSON and so Unstructured Mid Low Low Mid
on

Collaboration Emails, Slack, Teams, Unstructured Mid Mid High Mid


Systems/Public Govt sites, business
Webs sites sites
Media Videos, Audio files, Unstructured High High High High
Images

Social LinkedIn, Twitter, Unstructured High High High Mid


Networks TikTok, Instagram

Data Storage File streams, NoSQL, R- Structured/ Low Mid High Mid
ORDBMS Hybrid

Log files Application, events, Unstructured Mid High High High


transactions ,
clickstream
Sensor Medical devices, Unstructured Low High High High
Data/IoT Data Household Devices,
Security systems, Flight
systems
8

The Data Pipeline


Infrastructure for the data-driven decisions
for the
Data-driven Organization

© 2022, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
9

Work BACKWARDS to design your infrastructure

Weigh the trade-offs of cost, speed, and accuracy.

2. What data do you need to 1. What decision are you trying to


support this? make?

© 2022, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
9
10

Iterative processing through the pipeline

Evaluate
results
Additional and
3
data iterate
2

1
Predictions
Data
and
sources
decisions

© 2022, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
10
11

Data – A closer look


Strategies that support getting 12

the best value from data

• Confirm available data meets need.


• Do not aim for all the data; start with what is needed for your intended
use case.
• Evaluate the feasibility of acquiring data.
• Match pipeline design to data.
• Balance throughput and cost.
• Catalog data and metadata.
• Decide and Implement governance.
• Let users focus on business.
13

Data types

Easier to Structured • Files


• No predefined
use structure
"Hotter" • Examples: Images,
Semistructured
movies, clickstream
data
• Rows and columns
• Well-defined schema
Unstructured
• Example: Relational
database tables • Elements and attributes
• Self-describing structure
• Examples: CSV, JSON, XML

More flexible
"Colder"
Operational data workloads 14

Order
… … …

… … …

* * *

Data is stored in a database that is optimized for online transactional processing


(OLTP) operations that support applications Examples of using OLTP uses:

A mix of read and write activity • Online banking


For example: • Shopping carts
• Read the Product table to display a catalog • Booking a ticket
• ATM machines
• Write to the Order table to record a purchase
• Credit card payment processing
Data is stored using transactions (both online and in-store)
Transactions are "ACID" based: • Record keeping (including health
• Atomicity – Each transaction is treated as a single unit of work, which succeeds completely records, inventory control,
or fails completely production scheduling, claims
• Consistency – Transactions can only take the data in the database from one valid state to processing, customer service
another ticketing, and many other
• Isolation – Concurrent transactions cannot interfere with one another applications)
• Durability – When a transaction has succeeded, the data changes are persisted in the
database

Popular databases: MySQL, Oracle, MS SQL Server, MongoDB, PostgreSQL


© Copyright Microsoft Corporation. All rights reserved.
Analytical data workloads 15

2
3
1
4
▲----
▼----
▲----

1 Operational data is extracted, transformed, and loaded (ETL) into a data lake for analysis

2 Data is loaded into a schema of tables - typically in a Spark-based data lakehouse with tabular abstractions over files in
the data lake, or a data warehouse with a fully relational SQL engine

3 Data in tables may be aggregated and loaded into an online analytical processing (OLAP) model, or cube

4 The files in the data lake, relational tables, and analytical model can be queried to produce reports and dashboards

© Copyright Microsoft Corporation. All rights reserved.


16

Veracity across the data pipeline

Clean and
Preserve the integrity of the data as
transform data that
it's combined with other sources,
enters the pipeline
transformed, processed, and analyzed

Process
Data Ingest Analyze
sources
Store

• Discover the maturity of sources.


• Understand the state of source data.
• Have a process to maintain the integrity
of the data sources you control or at
least have a traceability to reflect the • Keep a golden copy of the data
changes • Ensure consistency of all occurrences of the data element.
• Keep changes to minimal and only provide access when it is
necessary
Data Lake Vs. Data Warehouse
17

Source: What is a data lake? - Azure Architecture Center | Microsoft Learn


Data Lakehouse 18

Source: https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html
19

Data Platforms
Understanding foundations
20
A Data Platform should support data engineers,
data analysts, data operations professionals

Data professionals work with multiple types of data to perform a variety of data operations using a range of tools and
scripting languages.

Types of data Data operations Languages

SQL
Structured Integration
SELECT…

Python
Semi-structured Transformation
df=spark.read(…)

R Java
Unstructured Consolidation .NET Others
Scala

Analysis

© Copyright Microsoft Corporation. All rights reserved.


Important data platform concepts 21

Operational and analytical data Streaming data Data pipeline

Orchestrated activities to transfer and transform data.


Operational: Transactional data used by applications Perpetual, real-time data feeds
Used to implement extract, transform, and load (ETL) or
Analytical: Optimized for analysis and reporting extract, load, and transform (ELT) operations.

Data Lake Data Warehouse Apache Spark

Analytical data stored in files Analytical data stored in a relational database Open-source engine for distributed
data processing
Distributed storage for massive scalability Typically modeled as a star schema to optimize
summary analysis

© Copyright Microsoft Corporation. All rights reserved.


22

Break (5 min)
26
23

Modern Data Platform


Design Principles
Example: Azure
28
24

Modern Data Analytics Architecture – Conceptual


Organization

Source: Big data architectures - Azure Architecture Center | Microsoft Learn


31
25

Modern Data Architecture Must Store:

Key design
considerations
• Scalable data lake
• Performant and
cost-effective
components
• Seamless data
movement
• Unified governance

Source: AWS Documentation


32
26

AWS services to manage


data movement and governance
Collect -> Store->

Key design
considerations
• Seamless data
movement
• Unified
governance Lake
Formation

AWS Glue

For Microsoft- Its


Purview

Source: AWS Documentation


33
27

Data Storage Types On Azure

• Depends on the
application and
business use cases
• Depends on OLTP and
OLAP data workloads

Microsoft Azure

source: Databases architecture design - Azure Reference Architectures | Microsoft Learn


A quick high-level glimpse 34
28

at the critical Data services in Azure

Operational data Data ingestion/ETL Analytical data storage and Data modeling and
processing visualization

Azure Synapse Analytics Microsoft Power BI

Azure Data Lake Storage Gen2


Azure Stream Analytics

Azure Databricks
Azure Data Factory

© Copyright Microsoft Corporation. All rights reserved.


End to End Azure Synapse Architecture – 35
29
Data Analytics Pipelines: Deeper view

No SQL Data Type

Source: Analytics end-to-end with Azure Synapse - Azure Architecture Center | Microsoft Learn
Matching ingestion services to variety, volume, and 37
31

velocity

Ingest
Azure Data
SaaS apps Factory

Azure Synapse
Analytics Pipelines
OLTP ERP CRM Business
Applications
Azure File Sync

File shares
Azur Event Hub

Web Devices IoT Social Azure IoT Hub


Sensors media

Azure Data Box

On Premise Data-
limited connectivity
Store and Manager Enterprise Data 38
32

Volume, Velocity, Veracity, and Variety (Hint: Data lake)

Blob Storage Azure Data Microsoft Azure Synapse


Access Tiers Share Purview Analytics

Azure Data Lake


Storage
Gen 2
39
33

Modern data architecture storage layer


Storage layer: Catalog

Microsoft Purview Azure Data Share

Storage layer: Storage

Azure Synapse
Analytics

Built-in integration

Azure Blob Storage –


Data Lake
40
34

Storage for variety, volume, and velocity

Storage layer: Storage


structured data is loaded into classic DWH schemas Use case: BI dashboards

Semistructured data is loaded into staging tables


Azure Synapse
Analytics

Ingest Process

Unstructured, semistructured, and


structured data is stored as objects
Use case: Big data AI/ML

Azure Data
Lake Storage
Gen 2
41
35

Data Lake Zones/Layers for data in different states

Raw layer or data lake one


Think of the raw layer as a
reservoir that stores data in its
natural and original state. It's
unfiltered and unpurified. You
might store the data in its original
format, such as JSON or CSV.

It might be cost-effective to store


the file contents as a column in a
compressed file format, like Avro,
Parquet, or Databricks Delta Lake.
This raw data is immutable.

Keep your raw data locked down,


and if you give permissions to any
consumers, automated or human,
ensure that they're read-only.

Souce: Data lake zones and containers - Cloud Adoption Framework | Microsoft Learn
42
36

Data Lake Zones/Layers..continued…


Enriched layer or data lake two

Think of the enriched layer as a


filtration layer. It removes
impurities and can also involve
enrichment.

Your standardization container


holds systems of record and
masters. Folders are segmented
first by subject area, then by
entity.

Data is available in merged,


partitioned tables that are
optimized for analytics
consumption.

Souce: Data lake zones and containers - Cloud Adoption Framework | Microsoft Learn
Data Lake Zones/Layers. Continued… 43
37

Curated layer or data lake two


Your curated layer is your
consumption layer. It's
optimized for analytics rather
than data ingestion or
processing. The curated layer
might store data in
denormalized data marts or star
schemas.

Data from your standardized


container is transformed into
high-value data products that are
served to your data
consumers. This data has
structure. It can be served to the
consumers as-is, such as data
science notebooks, or through
another read data store, such as
Azure SQL Database.

Souce: Data lake zones and containers - Cloud Adoption Framework | Microsoft Learn
44
38

Data Lake Zones/Layers. Continued…

Development layer or data lake three


Your data consumers can bring other useful data products along with the data
ingested into your standardized container.
Souce: Data lake zones and containers - Cloud Adoption Framework | Microsoft Learn
45
39
Extracting Insights from the Data
- Ad-hoc, structured, irrespective of data load

Azure Synapse Azure Data Bricks HDInsight Data Lake


Azure Data
Analytics Analytics
Explorer

Azure Stream
Elastic Jobs on Microsoft Purview Azure Data
Analytics
Azure Factory
46
40

Visualization and and machine learning services for advanced


analytics and Model predictions

Power Power BI Machine


Platform Learning
41

Break
5 min
42

Azur Data Engineering


Labs
43
Labs in total
Lab # Lab Name Mandatory Due Date, if any
1 Explore Synapse Yes 4/1/2025
2 Serverless SQL Yes 4/1/2025
3 Transform Data with SQL Yes 4/1/2025
4 Lake Database Yes 4/1/2025
5 Synapse Spark Yes 4/1/2025
6 Transform Spark Yes 4/1/2025
7 Spark Delta Lake No NA
8 Data Warehouse Yes 4/1/2025
9 Load Data Warehouse Yes 4/1/2025
10 Synapse Pipeline Yes 4/1/2025
11 Pipeline Notebook No NA
12 Synapse Link (Cosmos DB) Yes 4/1/2025
13 Synapse Link (SQL) No NA
14 Stream Analytics Yes 4/1/2025
15 Stream Analytics and Synapse No NA
16 Stream Analytics and Power BI Yes 4/1/2025
17 Synapse and Purview Yes 4/1/2025
18 Explore Databricks Yes 4/1/2025
19 Databricks Spark Yes 4/1/2025
20 Databricks Delta Lake Yes 4/1/2025
21 Databricks SQL No NA
22 Databricks ADF No NA
Practice Assessment: DP-203T00-A Data Engineering on Microsoft
23 Azure No NA
44

Before we start today’s lab

Have you created your Azure Skillable Account?


45

First Lab
Lab 1: Explore Synapse

Azure Synapse is an enterprise analytics service that accelerates time to insight across
data warehouses and big data systems. Azure Synapse brings together the best of SQL
technologies used in enterprise data warehousing, Spark technologies used for big
data, Data Explorer for log and time series analytics, Pipelines for data integration and
ETL/ELT, and deep integration with other Azure services such as Power BI, CosmosDB,
and AzureML.

What is Azure Synapse Analytics: https://learn.microsoft.com/en-us/azure/synapse-analytics/overview-what-is


46

First Lab

Lab 1: Explore Synapse : What we will accomplish today?

1. Explore GitHub Repo for Labs


2. Setup lab instance
3. Provision/setup environment for our exploration with
setup file
4. Explore Synapse Workspace Studio
5. Ingest Dat from an external source and play around with
SQL querying
6. Create Spark Pola and use PySpark notebook (PySpark: Python API for
Apache Spark)

7. Create ad-hoc/quick visualization of the data at hand


8. Understanding data /SQL pools and Data lake semantics

What is Azure Synapse Analytics: https://learn.microsoft.com/en-us/azure/synapse-analytics/overview-what-is


47

First Lab

Things to remember and avoid frustration:

1. Lab is a free and frugal resource (small instances)


2. Due to 1, any lab can fail during setup while provisioning
resources.
3. In case of 2, exit the setup o let the setup exit and then try again
after a few minutes
4. You have 10 instances per lab ( no more) so use it wisely.
5. Take screenshots (imp one) for every lab and add them to the lab-
specific document such as yourname_lab1.docx

What is Azure Synapse Analytics: https://learn.microsoft.com/en-us/azure/synapse-analytics/overview-what-is


48

Let’s go to Skillable!
47
49

Next Week

Storage On Modern Data Platforms:


Examples AWS and Azure
48
50

Thank you
51

Next Week

Storage On Modern Data Platforms:


Examples AWS and Azure

You might also like