Documentation Project
Documentation Project
Table of Contents
Project overview 2
Key Objectives 2
Project Methodology 2
Project Benefits 3
Used Service/Tool in Azure: 4
Architecture of the project 5
Environment Setup 6
Resource group 6
What is a resource group in Azure? 6
How to create a resource group in Azure 6
Best practices for using resource groups 8
Resource group - service 8
Azure Data Factory (ADF) 8
Key Features of ADF 8
Benefits of Using ADF 9
Resource group - Project 9
Azure Databricks 11
Key Features of Azure Databricks 11
Benefits of Using Azure Databricks 11
Resource group - Project 12
Azure Key Vault 14
Why use Azure Key Vault? 14
Key Vault features 14
How to use Azure Key Vault 15
Resource group - Project 15
Storage account 18
Key features of Azure Storage Account: 18
Types of Azure Storage Account: 18
Benefits of using Azure Storage Account: 18
Resource group - Project 19
Data source - on-premise database 21
Data source: SMSS - SQL server 21
Create Account and user 27
Data Ingestion: Azure data factory - setup 27
Configuration run Integration Service 27
Configuration Linked services 30
Configuration Pipeline 32
Lookup 34
ForEach 36
Data Transformation: Azure Databricks - setup 42
Databricks in the ADF (azure data factory) 43
Configuration Linked service 43
Pipeline Configuration 46
PowerBI - load transformed data 51
Azure Blob storage 51
Azure Data lake storage gen2 53
Glossary 55
Azure Blob Storage 55
Azure Data Lake Storage Gen2 55
Azure Data Lake Storage Gen 55
Parquet 56
Blob 56
Avro 56
Delta Format 56
Project overview
This data engineering project aims to migrate a company's on-premises database to
Azure, leveraging Azure Data Factory for data ingestion, transformation, and
storage. The project will implement a three-stage storage strategy, consisting of
bronze, silver, and gold data layers (Medalion architecture). Bronze data will
represent raw data extracted from the source database, silver data will undergo data
cleansing, transformation, and enrichment, while gold data will serve as the
aggregated and standardized data source for Power BI analytics. Azure Databricks
will be employed for data transformation tasks.
Key Objectives
1. Migrate data from on-premises database to Azure: Utilize Azure Data Factory
to seamlessly transfer data from the on-premises database to Azure storage
accounts.
2. Implement a three-stage data storage strategy: Establish a bronze, silver, and
gold data layer to handle raw, transformed, and aggregated data, respectively.
3. Leverage Azure Databricks for data transformation: Employ Azure Databricks'
Apache Spark engine to perform data cleansing, transformation, and
enrichment tasks.
4. Prepare data for Power BI analytics: Ensure that the gold data layer is in a
format suitable for loading into Power BI dashboards and reports.
Project Methodology
Project Benefits
Resource group
A resource group is a logical container that groups related Azure resources together.
This includes resources such as virtual machines (VMs), storage accounts,
databases, web apps, and more. Resource groups are essential for organizing and
managing your Azure infrastructure, and they offer several benefits, including:
Creating a resource group in Azure is a simple process that can be done from the
Azure portal. Here are the steps on how to create a resource group:
Here are some best practices for using resource groups in Azure:
Overall, Azure Data Factory is a powerful and versatile data integration service that
can help organizations of all sizes to streamline their data management processes.
Azure Databricks is a fully managed cloud service for data engineering, data
science, and machine learning. It combines the powerful Apache Spark engine with
a streamlined user experience to help organizations of all sizes harness the power of
their data.
Overall, Azure Databricks is a powerful and versatile data analytics platform that can
help organizations of all sizes harness the power of their data to gain insights, make
better decisions, and drive business innovation.
Azure Key Vault is a cloud-based service for securely storing and managing secrets,
including passwords, certificates, and cryptographic keys. It is a highly secure
service that uses hardware security modules (HSMs) to protect your secrets.
● Improved security: Azure Key Vault uses HSMs to protect your secrets, which
are more secure than storing them in your application code or database.
● Centralized management: Azure Key Vault provides a centralized location to
store and manage all of your secrets, which makes it easier to track and audit
them.
● Delegated access: You can control who has access to your secrets by using
Azure Active Directory (Azure AD). This can help to prevent unauthorized
access to your secrets.
● Reduced risk of breaches: By using Azure Key Vault, you can reduce the risk
of data breaches caused by stolen or compromised secrets.
Azure Key Vault offers a number of features that make it a powerful and versatile tool
for managing secrets. These features include:
● Secret storage: Azure Key Vault can store a variety of secrets, including
passwords, certificates, and cryptographic keys.
● Secret rotation: Azure Key Vault can automatically rotate your secrets on a
regular schedule to help protect against key compromise.
● Access control: Azure Key Vault uses Azure AD to control who has access to
your secrets. You can define granular access control policies to control who
can read, write, and delete your secrets.
● Auditing: Azure Key Vault provides comprehensive auditing logs that you can
use to track who accessed your secrets and when.
● Integration with other Azure services: Azure Key Vault can be integrated with
other Azure services, such as Azure App Service and Azure Functions. This
makes it easy to use your secrets in your applications.
How to use Azure Key Vault
You can use Azure Key Vault to store and manage secrets in a number of ways,
including:
● The Azure portal: You can use the Azure portal to create, manage, and
access your secrets.
● The Azure CLI: You can use the Azure CLI to automate tasks related to Azure
Key Vault.
● The .NET SDK: You can use the .NET SDK to integrate Azure Key Vault with
your .NET applications.
● The REST API: You can use the REST API to interact with Azure Key Vault
programmatically.
Overall, Azure Key Vault is a powerful and versatile service for securely storing and
managing secrets. It is an essential part of any cloud security strategy.
● Blob storage: Designed for storing unstructured data, such as images, videos,
and documents.
● File storage: Provides a managed file share solution for cloud-based
applications and on-premises file access.
● Queue storage: Optimized for storing and processing large numbers of
messages in a reliable and ordered manner.
● Table storage: Efficiently stores structured data in a NoSQL format, ideal for
applications that require fast access to large datasets.
Overall, Azure Storage Account is a versatile and scalable cloud storage solution
that can meet the needs of a wide range of organizations. Its durability, scalability,
security, and cost-effectiveness make it an ideal choice for storing and managing
essential data.
1. ADF
2. Azure Databricks
3. Key vault
4. Storage account
5. Synapse workspace
Data source will be taken from the Microsoft sample database the AdventureWorks -
AdventureWorksLT2022.bak
Link below:
https://learn.microsoft.com/en-us/sql/samples/adventureworks-install-configure?view=sql-ser
ver-ver16&tabs=ssms
Load the backup of the downloaded database (before you started load the backup to the
database, it might be necessary to move your download backup of database to the folder of
smss to following path: MSSQL16.SQLEXPRESS\MSSQL\Backup (backup folder of
MSSQL))
Create Account and user
To connect Azure Data Factory with the on-premise you need to create Integration runtimes,
that allow you to connect those two services.
3. Specify:
a. Name
b. Connect via integration runtime setup as the AutoResolveIntegrationRuntime
c. Authentication type, specify as: Account key
d. Azure subscription: select your subscription
e. Storage account name: select your storage account
f. Test connection: To linked service
g. Click test connection to check if you are able to connect to the serivce
Configuration Pipeline
The prepared pipeline will iterate through all tables in our database and take all tables with
schema “SalesLT”, after that all tables with schema “SalesLT” will be store in our azure
storage in the folder bronze, then by using data bricks we will conduct transformations for
layers silver and gold.
To create a pipeline you need to:
1. Go to manage
2. Click “+”
3. Pipeline → Pipeline
4. Name Pipeline (in the project is entitled as “Copy_table_onPremise”
In the Lookup activity, we will specify the tables, which we would like to copy to our storage
account. To do that you need:
1. Click on Lookup activity
2. In General type the name of the activities (in the project I used “Look up SQL
Tables”)
3. Go to Settings
4. Create new dataset (Click new “+”)
6. Type the name of the dataset (in my case I used TablesSQLDB) and important do not
specify table name
7. After creating dataset you should be able to select source in the lookup activities
8. In the Query panel you will use the query to list all tables with schema “SalesLT”
SQL Query:
SELECT
s.name AS SchemaName,
t.name AS TableName
FROM sys.tables t
INNER JOIN sys.schemas s
ON t.schema_id = s.schema_id
WHERE s.name = 'SalesLT'
ForEach
3. Select “Activity outputs” - Look up SQL Tables value array / You can also type
14. Select bronze folder (before, you need to create in Storage account service, three
containers - bronze / silver / gold), directory and file name leave blank. The structure
of the file, which will be used in the project is
bronze/Schema/Tablename/Tablename.parquet example:
bronze/SalesLT/Address/Address.parquet
15. Click “ok” then go to datasets and select your parquet dataset, that you created and
in the “file path” specified the following expressions:
Directory: @{concat(dataset().schemaname,'/',dataset().tablename)}
File name: @{concat(dataset().tablename,'.parquet')}
16. Go to Parameters and create two parameters: schemaname and tablename
17. Go back to pipeline → copy data activity (sink section) and select your parquet
dataset
18. In the dataset properties for the schemaname and tablename using following
expressions:
schemaname: @item().SchemaName
tablename: @item().TableName
Data Transformation: Azure Databricks - setup
1. Firstly you need to configure the Azure Data Lake Storage Gen2, to allow mount it in
databricks (if you have Azure Databricks premium you don’t have to do that), link
below:
https://www.linkedin.com/pulse/how-mount-adls-gen-2-storage-account-databricks-an
anya-nayak/
3. Set up single node, databricks runtime version, node type, terminate (it is important
to set up terminate, because it will disable used machine)
You need to specify a connection in Azure data factory to azure databricks, to trigger
notebooks, that have been written in the Azure databricks.
1. Create new activities “notebook databricks”: “bronze to silver” and “silver to gold”
Following these steps will enable you to create the data model in Power BI by loading only
the necessary tables.
Azure Data lake storage gen2
Following these steps will enable you to create the data model in Power BI by loading only
the necessary tables.
Glossary
Azure Blob Storage is a highly scalable, durable, and secure object storage service
that stores unstructured data objects, such as text, images, and videos. It is a
popular choice for storing large datasets that need to be accessed frequently. Blob
Storage supports several different data formats, including Avro, Parquet, and CSV.
Azure Data Lake Storage Gen2 is a high-performance, secure, and scalable data
lake storage service that stores large volumes of structured, semi-structured, and
unstructured data. It is designed to handle petabytes of data and can be used for a
variety of purposes, including data ingestion, data processing, and data analytics.
Data Lake Storage Gen2 supports a variety of data formats, including Parquet, Avro,
and JSON.
Azure Data Lake Storage Gen was a previous version of Azure Data Lake Storage
Gen2. It was a less scalable and less secure than Gen2, but it was still a popular
choice for storing large datasets. Gen is no longer supported, and all new
deployments should use Gen2.
Difference between Azure Data Lake Storage Gen vs Azure Data Lake Storage
Gen2
Here is a table that summarizes the key differences between Azure Data Lake
Storage Gen and Azure Data Lake Storage Gen2:
Feature Azure Data Lake Storage Gen Azure Data Lake Storage Gen2
Parquet
Parquet is a columnar data storage format that is optimized for analytical workloads.
It is commonly used in data lakes and data warehouses. Parquet files are
compressed and have a hierarchical file structure that makes it efficient to read and
write data.
Blob
A blob is a general-purpose data storage object that can store unstructured data,
such as text, images, and videos. Blobs are typically stored in a hierarchical file
system. Blob storage is a scalable and durable storage solution that is commonly
used for storing large volumes of data.
Avro
Avro is a data serialization format that is designed for efficiency and flexibility. It is a
self-describing format, which means that the schema of the data is stored in the file
itself. Avro is commonly used for storing structured data in data lakes and data
warehouses.
Delta Format
Delta Lake is a storage layer that sits on top of Apache Spark that provides a set of
capabilities for managing and analyzing large datasets. Delta Lake stores data in
Parquet files and uses a version control system to keep track of changes to the data.
Delta Lake also provides a number of features for managing and analyzing data,
such as data lineage and ACID transactions.